Fix: add lemma immunity in WordCoherencyRule to prevent false positiv… #11568

minizhiren · 2025-10-15T00:27:35Z

🧩 Summary

This PR fixes a false positive in the English Word Coherency Rule where certain inflected forms of words triggered unwanted "mixed spelling variant" warnings.

🐛 Problem

The rule incorrectly flagged these word:

doggies
doggier
doggiest

as inconsistent with the base forms doggy / doggie.
During testing (EnglishTest.testLanguage), this caused an assertion failure in the coherency.txt consistency check.

⚙️ Root Cause

The rule compared surface tokens directly against the variant map
without considering that an inflected word’s lemma (its base form)
might itself be one of the allowed variants.

As a result, legitimate inflections (e.g., “doggies”) were incorrectly treated as mixed variant usages.

🔧 Fix

Added a lemma-based immunity check in WordCoherencyRule to skip reporting
when a token’s lemma belongs to the same variant set defined in coherency.txt.

For example, if doggy and doggie are coherent variants,
then their lemmas are now correctly excluded from triggering false alarms for doggies, doggier, etc.

if (!Collections.disjoint(lemmas, variants)) {
// lemma itself is one of the coherent variants → inflected form → skip
continue;
}

🧠 Additional Notes

I also checked another repository (english-pos-dict)
and found that there are multiple word pairs similar to doggie/doggy
that could potentially cause the same issue.
However, only doggie was actually included in coherency.txt.

✅ Scope

Modified file: languagetool-language-modules/en/WordCoherencyRule.java

No core or cross-language code was changed.

This fix only affects the English module.

🧪How to Reproduce

mvn edu.illinois:nondex-maven-plugin:2.2.1:nondex -Dtest=org.languagetool.rules.en.EnglishTest#testLanguage -DnondexMode=ONE -DnondexSeed=933178 -Dsurefire.failIfNoSpecifiedTests=false

Summary by CodeRabbit

New Features
- Improved detection of inconsistent word variants across sentences with cross-sentence consistency checks and actionable replacement suggestions.
Bug Fixes
- Reduced false positives in variant checks for more accurate alerts.
Improvements
- Enhanced messaging and smarter replacement suggestions to make corrections faster and clearer.

…es for inflected forms

coderabbitai · 2025-10-15T00:28:13Z

Walkthrough

Adds a public match(List<AnalyzedSentence>) to WordCoherencyRule performing two-phase, per-sentence token/lemma analysis to detect opposite/variant word forms, build RuleMatch instances with messages and replacements, and log debug output; extends constructor initialization logging.

Changes

Cohort / File(s)	Summary of changes
Word coherency multi-sentence matching `languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java`	Added `public RuleMatch[] match(List<AnalyzedSentence>)` implementing a two-pass per-sentence analysis to collect lemmas/tokens, compute candidate keys, look up variant/opposite sets, filter inflections and past-tense fallbacks, generate `RuleMatch` instances with messages and suggested replacements, manage lemma immunity/position tracking, and emit debug logging; also added constructor debug logging when loading word data.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly summarizes the primary change by stating that lemma immunity is added to WordCoherencyRule to prevent false positives, directly reflecting the PR’s objective and naming the affected component without extraneous detail.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 504c0cb and 77ea1d5.

📒 Files selected for processing (1)

languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java (2)

languagetool-core/src/main/java/org/languagetool/AnalyzedToken.java (1)

AnalyzedToken (33-154)

languagetool-core/src/main/java/org/languagetool/rules/RuleMatch.java (1)

RuleMatch (43-739)

🔇 Additional comments (11)

languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java (11)
4-19: LGTM!

All imports are necessary for the new implementation and properly organized.

24-29: LGTM!

The debug setup is standard and appropriate for troubleshooting purposes.

35-40: LGTM!

The example pair clearly demonstrates the rule's purpose, and the debug logging is helpful for verification.

42-51: LGTM!

The method initialization and sentence loop structure are appropriate for the coherency detection logic.

52-77: LGTM! This is the core fix for the false positive issue.

The lemma-based immunity check at line 74 correctly prevents inflected forms (like "doggies", "doggier", "doggiest") from being flagged when their base form is in the variant map. The token and lemma extraction logic is sound.

112-119: LGTM!

The opposite registration logic correctly tracks variant groups across the document, enabling coherency detection for subsequent tokens.

147-166: LGTM!

The candidate key generation logic is well-structured, with appropriate fallback to past tense variants when no direct hits are found. Using LinkedHashSet preserves order while avoiding duplicates.

169-189: LGTM!

The past tense fallback logic is conservative and safe, with proper boundary checks for all string operations. The handling of hyphenated forms is a nice touch.

192-222: LGTM! This is the core immunity logic that prevents false positives.

The inflection detection is appropriately conservative, whitelisting only patterns unlikely to be verb forms (y-stem plurals/comparatives, generic -er/-est). The explicit exclusion of -s/-es/-ed/-ing prevents masking genuine verb form inconsistencies. This correctly addresses the issue described in the PR for words like "doggy"/"doggie" and their inflections.

102-102: No action needed: createReplacement is inherited from AbstractWordCoherencyRule and defined with the correct signature.

95-111: Fix: Set short message before adding RuleMatch to the list.

Line 110 calls setShortMessage() after the RuleMatch has already been added to ruleMatches on line 108. This ordering issue means the RuleMatch in the list won't have the short message set at the time it's added.

Apply this diff to fix the ordering:
           }
           if (!marked.equalsIgnoreCase(replacement)) {
             rm.setSuggestedReplacement(replacement);
+            rm.setShortMessage(getShortMessage());
             ruleMatches.add(rm);
           }
-          rm.setShortMessage(getShortMessage());
           break;
Likely an incorrect or invalid review comment.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

minizhiren · 2025-10-15T01:39:22Z

@coderabbitai generate docstrings

coderabbitai · 2025-10-15T01:39:59Z

✅ Actions performed

Initiated docstring generation; will generate only if new commits exist.

coderabbitai · 2025-10-15T01:40:02Z

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #11570

@minizhiren

Docstrings generation was requested by @minizhiren. * #11568 (comment) The following files were modified: * `languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java`

…n handling

Fix: add lemma immunity in WordCoherencyRule to prevent false positiv…

504c0cb

…es for inflected forms

coderabbitai bot mentioned this pull request Oct 15, 2025

📝 Add docstrings to fix-wordcoherencyrule #11570

Open

Refine WordCoherencyRule: safer fallback, cleanup, improved inflectio…

77ea1d5

…n handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: add lemma immunity in WordCoherencyRule to prevent false positiv… #11568

Fix: add lemma immunity in WordCoherencyRule to prevent false positiv… #11568

minizhiren commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

minizhiren commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: add lemma immunity in WordCoherencyRule to prevent false positiv… #11568

Are you sure you want to change the base?

Fix: add lemma immunity in WordCoherencyRule to prevent false positiv… #11568

Conversation

minizhiren commented Oct 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

minizhiren commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

minizhiren commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading