Skip to content

Conversation

@devanbenz
Copy link

We were seeing a state in which a shard would not perform full compactions leading to a build up of level 4 TSM files.

File state:

-rw-r--r--.  1 root root 2.1G Aug  5 20:59 000016684-000000007.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:02 000016684-000000008.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:04 000016684-000000009.tsm
-rw-r--r--.  1 root root 376M Aug  5 21:05 000016684-000000010.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016812-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016812-000000005.tsm
-rw-r--r--.  1 root root 1.3G Aug  5 21:21 000016844-000000002.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016948-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016948-000000005.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000017076-000000004.tsm

There is a rouge level 2 file packed within fully compacted files

-rw-r--r--.  1 root root 2.1G Aug  5 20:59 000016684-000000007.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:02 000016684-000000008.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:04 000016684-000000009.tsm
-rw-r--r--.  1 root root 376M Aug  5 21:05 000016684-000000010.tsm

and level 4 files

-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016948-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016948-000000005.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000017076-000000004.tsm

The area of our code that would cause this state to be skipped would be here

// step is how may files to compact in a group. We want to clamp it at 4 but also stil
// return groups smaller than 4.
step := 4
if step > end {
step = end
}
// slice off the generations that we'll examine
generations = generations[start:end]
// Loop through the generations in groups of size step and see if we can compact all (or
// some of them as group)
groups := []tsmGenerations{}
for i := 0; i < len(generations); i += step {
var skipGroup bool
startIndex := i
for j := i; j < i+step && j < len(generations); j++ {
gen := generations[j]
lvl := gen.level()
// Skip compacting this group if there happens to be any lower level files in the
// middle. These will get picked up by the level compactors.
if lvl <= 3 {
skipGroup = true
break
}
// Skip the file if it's over the max size and it contains a full block
if gen.size() >= uint64(tsdb.MaxTSMFileSize) && gen.files[0].FirstBlockCount >= tsdb.DefaultMaxPointsPerBlock && !gen.hasTombstones() {
startIndex++
continue
}
}
if skipGroup {
continue
}
endIndex := i + step
if endIndex > len(generations) {
endIndex = len(generations)
}
if endIndex-startIndex > 0 {
groups = append(groups, generations[startIndex:endIndex])
}
}
if len(groups) == 0 {
return nil, 0
}

We need to add some sort of escape mechanism that would allow for compactions to occur or simplify this logic.

This PR adds a test for the bugged file listing where we experience the issue. I'm still trying to find the way files get in to this state, but, this should resolve the issue if and when it occurs. I've added the check in PlanOptimize() as to not change any of the core logic in Plan().

Closes #

Describe your proposed changes here.

  • I've read the contributing section of the project README.
  • Signed CLA (if not already signed).

We were seeing a state in which a shard would not perform full compactions leading to a build up of level 4 TSM files.

File state:

```
-rw-r--r--.  1 root root 2.1G Aug  5 20:59 000016684-000000007.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:02 000016684-000000008.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:04 000016684-000000009.tsm
-rw-r--r--.  1 root root 376M Aug  5 21:05 000016684-000000010.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016812-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016812-000000005.tsm
-rw-r--r--.  1 root root 1.3G Aug  5 21:21 000016844-000000002.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016948-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016948-000000005.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000017076-000000004.tsm
```

There is a rouge level 2 file packed within fully compacted files

```
-rw-r--r--.  1 root root 2.1G Aug  5 20:59 000016684-000000007.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:02 000016684-000000008.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:04 000016684-000000009.tsm
-rw-r--r--.  1 root root 376M Aug  5 21:05 000016684-000000010.tsm
```

and level 4 files

```
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016948-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016948-000000005.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000017076-000000004.tsm
```

The area of our code that would cause this state to be skipped would be here

https://github.com/influxdata/influxdb/blob/22bec4f046a28e3f1fa815705362767151407e1b/tsdb/engine/tsm1/compact.go#L620-L670

We need to add some sort of escape mechanism that would allow for compactions to occur or simplify this logic.

This PR adds a test for the bugged file listing where we experience the issue. I'm still trying to find the way files get in to this
state, but, this should resolve the issue if and when it occurs. I've added the check in PlanOptimize() as to not change any of the core
logic in Plan().
@davidby-influx
Copy link
Contributor

The test fails without the second condition check? || cur.Len() < 2

@devanbenz
Copy link
Author

The test fails without the second condition check? || cur.Len() < 2

Correct, I should have pushed up a commit with just the test but if you remove the check it fails.

@davidby-influx
Copy link
Contributor

davidby-influx commented Aug 30, 2025 via email

@devanbenz devanbenz marked this pull request as ready for review September 2, 2025 14:21
@devanbenz devanbenz requested review from davidby-influx and gwossum and removed request for davidby-influx September 2, 2025 14:21
@devanbenz devanbenz self-assigned this Sep 2, 2025
* 2 level 2 files
* 3 level 2 files
* 4 level 2 files
* 5 level 2 files
…k at the files on disk a lot of these

l4 files are not at the max points per block, just at the max file size of 2.1 GB.
…we don't have nested lower level

files in our plans
// Skip compacting this group if there happens to be any lower level files in the
// middle. These will get picked up by the level compactors.
if lvl <= 3 {
if lvl <= 3 && len(gen.files) > 4 {
Copy link
Author

@devanbenz devanbenz Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be better to ensure that the files are nested within high level files. For example, lets say we have the following levels listed

5
6
7
8
4
5
2
2
4
5
4
5

We can assume that the 2 level 2 files need to be compacted as a group with our high level files.

But if we have

5
6
7
8
4
5
2
2

since the 2 level 2 files are at the end of the entire list of files we should not compact them with our higher groups

I found the following test case:

	data := []tsm1.FileStat{
		{
			Path: "01-04.tsm1",
			Size: 64 * 1024 * 1024,
		},
		{
			Path: "02-04.tsm1",
			Size: 64 * 1024 * 1024,
		},
		{
			Path: "03-04.tsm1",
			Size: 64 * 1024 * 1024,
		},
		{
			Path: "04-04.tsm1",
			Size: 64 * 1024 * 1024,
		},
		{
			Path: "05-02.tsm1",
			Size: 2 * 1024 * 1024,
		},
	}

Which is found here:

func TestDefaultPlanner_Plan_CompactsMiddleSteps(t *testing.T) {
data := []tsm1.FileStat{
{
Path: "01-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "02-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "03-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "04-04.tsm1",
Size: 64 * 1024 * 1024,
},
{
Path: "05-02.tsm1",
Size: 2 * 1024 * 1024,
},
}
cp := tsm1.NewDefaultPlanner(
newFakeFileStore(withFileStats(data)),
tsdb.DefaultCompactFullWriteColdDuration,
)
expFiles := []tsm1.FileStat{data[0], data[1], data[2], data[3]}
tsm, pLen := cp.Plan(time.Now())
if exp, got := len(expFiles), len(tsm[0]); got != exp {
t.Fatalf("tsm file length mismatch: got %v, exp %v", got, exp)
} else if pLen != int64(len(tsm)) {
t.Fatalf("tsm file plan length mismatch: got %v, exp %v", pLen, exp)
}
for i, p := range expFiles {
if got, exp := tsm[0][i], p.Path; got != exp {
t.Fatalf("tsm file mismatch: got %v, exp %v", got, exp)
}
}
}

With this change does not pass. I would expect for the first four files to get compacted but the last one to be left out since its at the end of the file list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pick up the level 2 files at the end? They will get the benefits of the better compression from inclusion with higher level files. Are you concerned it will be extra work on a hot shard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracting the above comment after discussion.

@davidby-influx
Copy link
Contributor

@devanbenz - I think we need to create a new test case file. In it, we should have many, many permutations of compaction levels and generations. Here are a few examples:

4,5,4,5,4
2,4,5,4,5,4
3,2,4,5,4,5,4
3,2,4,5,4,5,4,2
2,2,3,2,4,5,4,5,4,2
2,2,3,2,4,5,2,4,5,4,2
2,2,3,2,4,5,2,2,4,5,3,4,2
4,5,2,2,4,5,3,5,4,2,2,2
4,5,2,2,4,5,3,5,2,3,4,2,2,2,5,4,5,6

We want to test

  • leading low levels with runs from 1 to 9.
  • trailing low levels,
  • nested low levels (with varying run lengths from 1 to 9),
  • multiple nested low levels with varying run lengths
  • leading low levels and multiple nested low levels with varying run lengths
  • leading low levels and multiple nested low levels with varying run lengths and trailing low levels
  • etc.

Create compact_case_test.go, generate all the test cases (maybe statically with Claude, maybe programmatically at run time) with labels (don't forget to vary points per block and file size). This will be a large set of permutations, so test them as you go along with whatever algorithm choices you have made to the planner.

Claude generated the test cases and I modified the results as to what I would expect.
@devanbenz devanbenz marked this pull request as draft September 3, 2025 16:34
@devanbenz
Copy link
Author

@devanbenz - I think we need to create a new test case file. In it, we should have many, many permutations of compaction levels and generations. Here are a few examples:

4,5,4,5,4
2,4,5,4,5,4
3,2,4,5,4,5,4
3,2,4,5,4,5,4,2
2,2,3,2,4,5,4,5,4,2
2,2,3,2,4,5,2,4,5,4,2
2,2,3,2,4,5,2,2,4,5,3,4,2
4,5,2,2,4,5,3,5,4,2,2,2
4,5,2,2,4,5,3,5,2,3,4,2,2,2,5,4,5,6

We want to test

* leading low levels with runs from 1 to 9.

* trailing low levels,

* nested low levels (with varying run lengths from 1 to 9),

* multiple nested low levels with varying run lengths

* leading low levels and multiple nested low levels with varying run lengths

* leading low levels and multiple nested low levels with varying run lengths and trailing low levels

* etc.

Create compact_case_test.go, generate all the test cases (maybe statically with Claude, maybe programmatically at run time) with labels (don't forget to vary points per block and file size). This will be a large set of permutations, so test them as you go along with whatever algorithm choices you have made to the planner.

I've found the static test file to work better this AM while working on this. I'm going to add the same tests duplicated but using different block counts to see how it impacts the planner + expected results.


var planner CompactionPlanner = NewDefaultPlanner(fs, time.Duration(opt.Config.CompactFullWriteColdDuration))
planner.SetAggressiveCompactionPointsPerBlock(int(opt.Config.AggressivePointsPerBlock))
planner.SetNestedCompactor(opt.Config.NestedCompactorEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps log here if the config is set to true. I think that will be helpful confirmation. Sometimes we get logs but not a config file (although with Cloud1 we have direct access to both).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely log. Info level. Great suggestion.

"000016684-000000007.tsm",
"000016812-000000004.tsm",
"000016853-000000002.tsm",
"000016948-000000004.tsm",
Copy link
Contributor

@philjb philjb Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. This compaction group will merge these tsm files but that will produce incorrect overwrite semantics as the files in the group aren't adjacent.

The level 2 group looks right --- it pulled 4 level 2 files together into a group that are sequential. They will be compacted into a level 3 generation that will be correct for overwrites, producing a 000016844-3

But if the weighting of the scheduler picks this level4group to be run, it'll compact 684-7, 812-5, 853-2, 948-4 together and produce a 000016684-8 which will be older than the one from the level 2 group (16684 < 16844) and overwrites will be broken.

So you might need to check there isn't a run of non-fully compacted generations longer than 4 (which is what the level 2 & 3 compactions look for.

			if i - lastHighLevelIndex > 3 {
				break
			}

Copy link
Contributor

@philjb philjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an issue in one of the tests -- makes me wonder if other tests are incorrectly combining tsm files that aren't adjacent. I think each test case could have a check that given the input files/generations and the expected output, all generations in all groups are adjacent -- just to verify the test is accurate too.

…iles but it's less than

4 len AND we have multiple level < 4 files that can be compacted in full i.e. nested lower
level files. We will compact them all together if that's the case. Please see our test
compact_case_static_test.go/Mixed generations with 4 level 2 files
for i, cur := range groups {
if len(groups) > i+1 {
nextLevel := groups[i+1]
if cur.level() < nextLevel.level() {
Copy link
Contributor

@philjb philjb Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is handling the groups as we want.

concern these generations:
old 4, 4, 4, 2, 2, 2, 2, 1, 1, 1, 1, 4, 4, 4, 3, 3, 3, 2, 2, 1, 1, 1, 1, 1, 1 young
They will group into:
old G4, G2, G1, G4', G3, G2', G1 young
This conditional will "skip/continue" G1 but not G2 even though both G1 and G2 are nested within L4 generations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to scan ahead. I don't think you can "only" look at the level of an immediately subsequent group.

Perhaps this code:

	// Remove any groups in the wrong level
	var levelGroups []tsmGenerations
	if c.enableNestedCompactor {
		// search for runs of groups nested between level 4+ groups
		// e.g. G4, G4, G1, G4, G2 ... = G1
		// e.g. G4, G4, G2, G4, G1, G4 ... = (G2), (G1)
		// e.g. G4, G4, G2, G1, G3, G1, G4, G1, G4 ... = (G2, G1, G3, G1), (G1)
		skip := make([]bool, len(groups))
		lastG4 := 0
		for i, cur := range groups {
			lvl := cur.level()
			if  lvl >= 4 {
				if i > 0 && lastG4 != i - 1 {
					// we have returned to L4+ after seeing L3-
					// skip all the groups in between
					for j := lastG4+1; j < i; j++ {
						skip[j] = true
					}
				}
				lastG4 = i
			}
			
			if lvl != level {
				// normal (non-nested) behavior: group's level doesn't match the requested planning level
				skip[i] = true
			}
		}
		// process skip list
		for i, cur := range groups {
			if !skip[i] {
				levelGroups = append(levelGroups, cur)
			}
		}
	} else {
		for _, cur := range groups {
			if cur.level() == level {
				levelGroups = append(levelGroups, cur)
			}
		}
	}

The most important thing is that we don't change the ordering of input
so this will stop that, as we check the index anyways 🤔
This commit will add a check to ensure that we do not have non-adjacent files for our
l4 and l5 planning. Since they can both plan and be comprised of similar files we need to
ensure that all files for all groups are adjacent

For example, lets say we have the following generations

1_2_3_4_5_6_7_8

our l4 plan picks up 3_4_5_6
our l5 plan picks up 1_2_7_8

There is now a non-adjacent run within l5 while the middle is picked up by l4

l5: 1_2_x_x_x_x_7_8
l4: x_x_3_4_5_6_x_x

This will make it so if this is encountered we instead will get

l5: 1_2_x_x_x_x_x_x
l4: x_x_3_4_5_6_7_8

satisfying our adjacent file policy
@devanbenz
Copy link
Author

Closing this PR per @davidby-influx's implementation is in master-1.x.

@devanbenz devanbenz closed this Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants