Skip to content

Commit f0f02a9

Browse files
authored
Merge pull request #366 from morgo/mtocker-document-lock-wait-timeout
sort USAGE alphabetically, add lock-wait-timeout docs
2 parents a04a2ee + f4764d1 commit f0f02a9

File tree

1 file changed

+84
-77
lines changed

1 file changed

+84
-77
lines changed

USAGE.md

Lines changed: 84 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -10,27 +10,29 @@ go build
1010

1111
## Configuration
1212

13-
### host
13+
### alter
1414

1515
- Type: String
16-
- Default value: `localhost:3306`
17-
- Examples: `mydbhost`, `mydbhost:3307`
16+
- Default value: `engine=innodb`
17+
- Examples: `add column foo int`, `add index foo (bar)`
1818

19-
The host (and optional port) to use when connecting to MySQL.
19+
The alter table command to perform. The default value is a _null alter table_, which can be useful for testing.
2020

21-
### username
21+
### checksum
2222

23-
- Type: String
24-
- Default value: `msandbox`
23+
- Type: Boolean
24+
- Default value: TRUE
2525

26-
The username to use when connecting to MySQL.
26+
When set to `TRUE`, Spirit will perform a checksum of the data in the table after the copy phase. This is a good way to ensure that the copy phase was successful, but it does add some overhead to the process. When you resume-from-checkpoint, Spirit will only run with the checksum enabled (regardless of your configuration). This is because it can not rely on duplicate-key errors to detect issues in the copy phase if the DDL included adding a new `UNIQUE` key.
2727

28-
### password
28+
The checksum typically adds about 10-20% of additional time to the migration, but it is recommended to always leave it enabled. A failed checksum means that there is either:
29+
- A bug in Spirit
30+
- A bug in MySQL
31+
- Hardware errors
2932

30-
- Type: String
31-
- Default value: `msandbox`
33+
Checksum failure is not fatal. Spirit will re-copy chunks that fail checksums automatically during the checksum process, and then re-run the checksum. If the checksum completes without error on a subsequent run then the entire checksum operation is successful. Three successive attempts to checksum where differences were found will result in Spirit exiting with an error.
3234

33-
The password to use when connecting to MySQL.
35+
In testing, the checksum feature has identified corruption issues on desktops with non ECC memory. You may believe that this is what the InnoDB page checksums are for, but they are more specifically for detecting corruption introduced from the IO layer. Memory based corruption is not detected and remains common.
3436

3537
### database
3638

@@ -39,55 +41,19 @@ The password to use when connecting to MySQL.
3941

4042
The database that the schema change will be performed in.
4143

42-
### table
43-
44-
- Type: String
45-
- Default value: `stock`
46-
47-
The table that the schema change will be performed on.
48-
49-
### alter
50-
51-
- Type: String
52-
- Default value: `engine=innodb`
53-
- Examples: `add column foo int`, `add index foo (bar)`
54-
55-
The alter table command to perform. The default value is a _null alter table_, which can be useful for testing.
56-
57-
### threads
58-
59-
- Type: Integer
60-
- Default value: `4`
61-
- Range: `1-64`
62-
63-
Spirit uses `threads` to set the parallelism of:
64-
- The copier task
65-
- The checksum task
66-
- The replication applier task
67-
68-
Internal to Spirit, the database pool size is set to `threads + 1`. This is intentional because the replication applier runs concurrently to the copier and checksum tasks, and using a shared-pool prevents the worst case of `threads * 2` being used. The tradeoff of `+1` allows the replication applier to always make some progress, while not bursting too far beyond the user's intended concurrency limit.
69-
70-
You may want to wrap `threads` in automation and set it to a percentage of the cores of your database server. For example, if you have a 32-core machine you may choose to set this to `8`. Approximately 25% is a good starting point, making sure you always leave plenty of free cores for regular database operations. If your migration is IO bound and/or your IO latency is high (such as Aurora) you may even go higher than 25%.
71-
72-
Note that Spirit does not support dynamically adjusting the number of threads while running, but it does support automatically resuming from a checkpoint if it is killed. This means that if you find that you've misjudged the number of threads (or [target-chunk-time](#target-chunk-time)), you can simply kill the Spirit process and start it again with different values.
44+
### defer-cutover
7345

74-
### target-chunk-time
46+
The "defer cutover" feature makes spirit wait to perform the final cutover until a "sentinel" table has been dropped. This is similar to the --postpone-cut-over-flag-file feature of gh-ost.
7547

76-
- Type: Duration
77-
- Default value: `500ms`
78-
- Range: `100ms-5s`
79-
- Typical safe values: `100ms-1s`
48+
The defer cutover feature will not be used and the sentinel table will not be created if the schema migration can be successfully executed using ALGORITHM=INSTANT (see "Attempt Instant DDL" in README.md).
8049

81-
The target time for each copy or checksum operation. Note that the chunk size is specified as a _target time_ and not a _target rows_. This is helpful because rows can be inconsistent when you consider some tables may have a lot of columns or secondary indexes, or copy tasks may slow down as the workload becomes IO bound.
50+
If defer-cutover is true, Spirit will create a "sentinel" table in the same schema as the table being altered; the name of the sentinel table will use the pattern `_<table>_sentinel`. Spirit will block before the cutover, waiting for the operator to manually drop the sentinel table, which triggers Spirit to proceed with the cutover. Spirit will never delete the sentinel table on its own. It will block for 48 hours waiting for the sentinel table to be dropped by the operator, after which it will exit with an error.
8251

83-
The target is not a hard limit, but rather a guideline which is recalculated based on a 90th percentile from the last 10 chunks that were copied. You should expect some outliers where the copy time is higher than the target. Outliers >5x the target will print to the log, and force an immediate reduction in how many rows are copied per chunk without waiting for the next recalculation.
52+
You can resume a migration from checkpoint and Spirit will start waiting again for you to drop the sentinel table. You can also choose to delete the sentinel table before restarting Spirit, which will cause it to resume from checkpoint and complete the cutover without waiting, even if you have again enabled defer-cutover for the migration.
8453

85-
Larger values generally yield better performance, but have consequences:
86-
- A `5s` value means that at any point replicas will appear `5s` behind the source. Spirit does not support read-replicas, so we do not typically consider this a problem. See [replica-max-lag](#replica-max-lag) for more context.
87-
- Data locks (row locks) are held for the duration of each transaction, so even a `1s` chunk may lead to frustrating user experiences. Consider the scenario that a simple update query usually takes `<5ms`. If it tries to update a row that has just started being copied it will now take approximately `1.005s` to complete. In scenarios where there is a lot of contention around a few rows, this could even lead to a large backlog of queries waiting to be executed.
88-
- It is recommended to set the target chunk time to a value for which if queries increased by this much, user experience would still be acceptable even if a little frustrating. In some of our systems this means up to `2s`. We do not know of scenarios where values should ever exceed `5s`. If you can tolerate more unavailability, consider running DDL directly on the MySQL server.
54+
If you start a migration and realize that you forgot to set defer-cutover, worry not! You can manually create a sentinel table using the pattern `_<table>_sentinel`, and Spirit will detect the table before the cutover is completed and block as though defer-cutover had been enabled from the beginning.
8955

90-
Note that Spirit does not support dynamically adjusting the target-chunk-time while running, but it does support automatically resuming from a checkpoint if it is killed. This means that if you find that you've misjudged the number of [threads](#threads) or target-chunk-time, you can simply kill the Spirit process and start it again with different values.
56+
Note that the checksum, if enabled, will be computed after the sentinel table is dropped. Because the checksum step takes an estimated 10-20% of the migration, the cutover will not occur immediately after the sentinel table is dropped.
9157

9258
### force-inplace
9359

@@ -98,21 +64,29 @@ When set to `TRUE`, Spirit will attempt to perform the schema change using MySQL
9864

9965
Even when force-inplace is `FALSE`, Spirit automatically detects "safe" operations that use the `INPLACE` algorithm. These include operations that modify only metadata, specifically `ALTER INDEX .. VISIBLE/INVISIBLE`, `DROP KEY/INDEX` and `RENAME KEY/INDEX`. Consult https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-operations.html for more details.
10066

101-
### checksum
67+
### host
10268

103-
- Type: Boolean
104-
- Default value: TRUE
69+
- Type: String
70+
- Default value: `localhost:3306`
71+
- Examples: `mydbhost`, `mydbhost:3307`
10572

106-
When set to `TRUE`, Spirit will perform a checksum of the data in the table after the copy phase. This is a good way to ensure that the copy phase was successful, but it does add some overhead to the process. When you resume-from-checkpoint, Spirit will only run with the checksum enabled (regardless of your configuration). This is because it can not rely on duplicate-key errors to detect issues in the copy phase if the DDL included adding a new `UNIQUE` key.
73+
The host (and optional port) to use when connecting to MySQL.
10774

108-
The checksum typically adds about 10-20% of additional time to the migration, but it is recommended to always leave it enabled. A failed checksum means that there is either:
109-
- A bug in Spirit
110-
- A bug in MySQL
111-
- Hardware errors
75+
## lock-wait-timeout
11276

113-
Checksum failure is not fatal. Spirit will re-copy chunks that fail checksums automatically during the checksum process, and then re-run the checksum. If the checksum completes without error on a subsequent run then the entire checksum operation is successful. Three successive attempts to checksum where differences were found will result in Spirit exiting with an error.
77+
- Type: Duration
78+
- Default value: `30s`
11479

115-
In testing, the checksum feature has identified corruption issues on desktops with non ECC memory. You may believe that this is what the InnoDB page checksums are for, but they are more specifically for detecting corruption introduced from the IO layer. Memory based corruption is not detected and remains common.
80+
Spirit requires an exclusive metadata lock for cutover and checksum operations. The MySQL default for waiting for a metadata lock is 1 year(!), which means that if there are any long running transactions holding a shared lock on the table that prevent the exclusive lock from being acquired, new lock requests will effectively queue forever behind Spirit's exclusive lock request. To prevent Spirit causing such outages, Spirit sets the `lock_wait_timeout` to 30s by default.
81+
82+
If you are seeing cutover or checksum lock requests failing, you may consider increasing the `lock_wait_timeout`. However, it is almost always better to investigate why you have long running transactions that are preventing Spirit from acquiring the metadata lock. A good starting point is `select * from information_schema.INNODB_TRX`.
83+
84+
### password
85+
86+
- Type: String
87+
- Default value: `msandbox`
88+
89+
The password to use when connecting to MySQL.
11690

11791
### replica-dsn
11892

@@ -138,27 +112,60 @@ The replication throttler only affects the copy-rows operation, and does not app
138112
- Temporarily disabling durability on the replica (i.e. `SET GLOBAL sync_binlog=0` and `SET GLOBAL innodb_flush_log_at_trx_commit=0`)
139113
- Increasing the `replica-max-lag` or disabling replica lag checking temporarily
140114

141-
### defer-cutover
115+
### strict
142116

143-
The "defer cutover" feature makes spirit wait to perform the final cutover until a "sentinel" table has been dropped. This is similar to the --postpone-cut-over-flag-file feature of gh-ost.
117+
- Type: Boolean
118+
- Default value: FALSE
144119

145-
The defer cutover feature will not be used and the sentinel table will not be created if the schema migration can be successfully executed using ALGORITHM=INSTANT (see "Attempt Instant DDL" in README.md).
120+
By default, Spirit will automatically clean up these old checkpoints before starting the schema change. This allows schema changes to always be possible to proceed forward, at the risk of lost progress.
146121

147-
If defer-cutover is true, Spirit will create a "sentinel" table in the same schema as the table being altered; the name of the sentinel table will use the pattern `_<table>_sentinel`. Spirit will block before the cutover, waiting for the operator to manually drop the sentinel table, which triggers Spirit to proceed with the cutover. Spirit will never delete the sentinel table on its own. It will block for 48 hours waiting for the sentinel table to be dropped by the operator, after which it will exit with an error.
122+
When set to `TRUE`, if Spirit encounters a checkpoint belonging to a previous migration, it will validate that the alter statement matches the `--alter` parameter. If the validation fails, spirit will exit and prevent the schema change process from proceeding.
148123

149-
You can resume a migration from checkpoint and Spirit will start waiting again for you to drop the sentinel table. You can also choose to delete the sentinel table before restarting Spirit, which will cause it to resume from checkpoint and complete the cutover without waiting, even if you have again enabled defer-cutover for the migration.
124+
### table
150125

151-
If you start a migration and realize that you forgot to set defer-cutover, worry not! You can manually create a sentinel table using the pattern `_<table>_sentinel`, and Spirit will detect the table before the cutover is completed and block as though defer-cutover had been enabled from the beginning.
126+
- Type: String
127+
- Default value: `stock`
152128

153-
Note that the checksum, if enabled, will be computed after the sentinel table is dropped. Because the checksum step takes an estimated 10-20% of the migration, the cutover will not occur immediately after the sentinel table is dropped.
129+
The table that the schema change will be performed on.
154130

155-
### strict
131+
### target-chunk-time
156132

157-
- Type: Boolean
158-
- Default value: FALSE
133+
- Type: Duration
134+
- Default value: `500ms`
135+
- Range: `100ms-5s`
136+
- Typical safe values: `100ms-1s`
159137

160-
By default, Spirit will automatically clean up these old checkpoints before starting the schema change. This allows schema changes to always be possible to proceed forward, at the risk of lost progress.
138+
The target time for each copy or checksum operation. Note that the chunk size is specified as a _target time_ and not a _target rows_. This is helpful because rows can be inconsistent when you consider some tables may have a lot of columns or secondary indexes, or copy tasks may slow down as the workload becomes IO bound.
161139

162-
When set to `TRUE`, if Spirit encounters a checkpoint belonging to a previous migration, it will validate that the alter statement matches the `--alter` parameter. If the validation fails, spirit will exit and prevent the schema change process from proceeding.
140+
The target is not a hard limit, but rather a guideline which is recalculated based on a 90th percentile from the last 10 chunks that were copied. You should expect some outliers where the copy time is higher than the target. Outliers >5x the target will print to the log, and force an immediate reduction in how many rows are copied per chunk without waiting for the next recalculation.
141+
142+
Larger values generally yield better performance, but have consequences:
143+
- A `5s` value means that at any point replicas will appear `5s` behind the source. Spirit does not support read-replicas, so we do not typically consider this a problem. See [replica-max-lag](#replica-max-lag) for more context.
144+
- Data locks (row locks) are held for the duration of each transaction, so even a `1s` chunk may lead to frustrating user experiences. Consider the scenario that a simple update query usually takes `<5ms`. If it tries to update a row that has just started being copied it will now take approximately `1.005s` to complete. In scenarios where there is a lot of contention around a few rows, this could even lead to a large backlog of queries waiting to be executed.
145+
- It is recommended to set the target chunk time to a value for which if queries increased by this much, user experience would still be acceptable even if a little frustrating. In some of our systems this means up to `2s`. We do not know of scenarios where values should ever exceed `5s`. If you can tolerate more unavailability, consider running DDL directly on the MySQL server.
146+
147+
Note that Spirit does not support dynamically adjusting the target-chunk-time while running, but it does support automatically resuming from a checkpoint if it is killed. This means that if you find that you've misjudged the number of [threads](#threads) or target-chunk-time, you can simply kill the Spirit process and start it again with different values.
148+
149+
### threads
163150

151+
- Type: Integer
152+
- Default value: `4`
153+
- Range: `1-64`
154+
155+
Spirit uses `threads` to set the parallelism of:
156+
- The copier task
157+
- The checksum task
158+
- The replication applier task
159+
160+
Internal to Spirit, the database pool size is set to `threads + 1`. This is intentional because the replication applier runs concurrently to the copier and checksum tasks, and using a shared-pool prevents the worst case of `threads * 2` being used. The tradeoff of `+1` allows the replication applier to always make some progress, while not bursting too far beyond the user's intended concurrency limit.
161+
162+
You may want to wrap `threads` in automation and set it to a percentage of the cores of your database server. For example, if you have a 32-core machine you may choose to set this to `8`. Approximately 25% is a good starting point, making sure you always leave plenty of free cores for regular database operations. If your migration is IO bound and/or your IO latency is high (such as Aurora) you may even go higher than 25%.
163+
164+
Note that Spirit does not support dynamically adjusting the number of threads while running, but it does support automatically resuming from a checkpoint if it is killed. This means that if you find that you've misjudged the number of threads (or [target-chunk-time](#target-chunk-time)), you can simply kill the Spirit process and start it again with different values.
165+
166+
### username
167+
168+
- Type: String
169+
- Default value: `msandbox`
164170

171+
The username to use when connecting to MySQL.

0 commit comments

Comments
 (0)