Skip to content

Conversation

@minizhiren
Copy link
Contributor

@minizhiren minizhiren commented Oct 15, 2025

🧩 Summary

This PR fixes a false positive in the English Word Coherency Rule where certain inflected forms of words triggered unwanted "mixed spelling variant" warnings.

🐛 Problem

The rule incorrectly flagged these word:

doggies
doggier
doggiest

as inconsistent with the base forms doggy / doggie.
During testing (EnglishTest.testLanguage), this caused an assertion failure in the coherency.txt consistency check.

⚙️ Root Cause

The rule compared surface tokens directly against the variant map
without considering that an inflected word’s lemma (its base form)
might itself be one of the allowed variants.

As a result, legitimate inflections (e.g., “doggies”) were incorrectly treated as mixed variant usages.

🔧 Fix

Added a lemma-based immunity check in WordCoherencyRule to skip reporting
when a token’s lemma belongs to the same variant set defined in coherency.txt.

For example, if doggy and doggie are coherent variants,
then their lemmas are now correctly excluded from triggering false alarms for doggies, doggier, etc.

if (!Collections.disjoint(lemmas, variants)) {
// lemma itself is one of the coherent variants → inflected form → skip
continue;
}

🧠 Additional Notes

I also checked another repository (english-pos-dict)
and found that there are multiple word pairs similar to doggie/doggy
that could potentially cause the same issue.
However, only doggie was actually included in coherency.txt.

✅ Scope

Modified file: languagetool-language-modules/en/WordCoherencyRule.java

No core or cross-language code was changed.

This fix only affects the English module.

🧪How to Reproduce

mvn edu.illinois:nondex-maven-plugin:2.2.1:nondex -Dtest=org.languagetool.rules.en.EnglishTest#testLanguage -DnondexMode=ONE -DnondexSeed=933178 -Dsurefire.failIfNoSpecifiedTests=false

Summary by CodeRabbit

  • New Features

    • Improved detection of inconsistent word variants across sentences with cross-sentence consistency checks and actionable replacement suggestions.
  • Bug Fixes

    • Reduced false positives in variant checks for more accurate alerts.
  • Improvements

    • Enhanced messaging and smarter replacement suggestions to make corrections faster and clearer.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

Walkthrough

Adds a public match(List<AnalyzedSentence>) to WordCoherencyRule performing two-phase, per-sentence token/lemma analysis to detect opposite/variant word forms, build RuleMatch instances with messages and replacements, and log debug output; extends constructor initialization logging.

Changes

Cohort / File(s) Summary of changes
Word coherency multi-sentence matching
languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java
Added public RuleMatch[] match(List<AnalyzedSentence>) implementing a two-pass per-sentence analysis to collect lemmas/tokens, compute candidate keys, look up variant/opposite sets, filter inflections and past-tense fallbacks, generate RuleMatch instances with messages and suggested replacements, manage lemma immunity/position tracking, and emit debug logging; also added constructor debug logging when loading word data.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the primary change by stating that lemma immunity is added to WordCoherencyRule to prevent false positives, directly reflecting the PR’s objective and naming the affected component without extraneous detail.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 504c0cb and 77ea1d5.

📒 Files selected for processing (1)
  • languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java (2)
languagetool-core/src/main/java/org/languagetool/AnalyzedToken.java (1)
  • AnalyzedToken (33-154)
languagetool-core/src/main/java/org/languagetool/rules/RuleMatch.java (1)
  • RuleMatch (43-739)
🔇 Additional comments (11)
languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java (11)

4-19: LGTM!

All imports are necessary for the new implementation and properly organized.


24-29: LGTM!

The debug setup is standard and appropriate for troubleshooting purposes.


35-40: LGTM!

The example pair clearly demonstrates the rule's purpose, and the debug logging is helpful for verification.


42-51: LGTM!

The method initialization and sentence loop structure are appropriate for the coherency detection logic.


52-77: LGTM! This is the core fix for the false positive issue.

The lemma-based immunity check at line 74 correctly prevents inflected forms (like "doggies", "doggier", "doggiest") from being flagged when their base form is in the variant map. The token and lemma extraction logic is sound.


112-119: LGTM!

The opposite registration logic correctly tracks variant groups across the document, enabling coherency detection for subsequent tokens.


147-166: LGTM!

The candidate key generation logic is well-structured, with appropriate fallback to past tense variants when no direct hits are found. Using LinkedHashSet preserves order while avoiding duplicates.


169-189: LGTM!

The past tense fallback logic is conservative and safe, with proper boundary checks for all string operations. The handling of hyphenated forms is a nice touch.


192-222: LGTM! This is the core immunity logic that prevents false positives.

The inflection detection is appropriately conservative, whitelisting only patterns unlikely to be verb forms (y-stem plurals/comparatives, generic -er/-est). The explicit exclusion of -s/-es/-ed/-ing prevents masking genuine verb form inconsistencies. This correctly addresses the issue described in the PR for words like "doggy"/"doggie" and their inflections.


102-102: No action needed: createReplacement is inherited from AbstractWordCoherencyRule and defined with the correct signature.


95-111: Fix: Set short message before adding RuleMatch to the list.

Line 110 calls setShortMessage() after the RuleMatch has already been added to ruleMatches on line 108. This ordering issue means the RuleMatch in the list won't have the short message set at the time it's added.

Apply this diff to fix the ordering:

           }
           if (!marked.equalsIgnoreCase(replacement)) {
             rm.setSuggestedReplacement(replacement);
+            rm.setShortMessage(getShortMessage());
             ruleMatches.add(rm);
           }
-          rm.setShortMessage(getShortMessage());
           break;

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@minizhiren
Copy link
Contributor Author

@coderabbitai generate docstrings

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

✅ Actions performed

Initiated docstring generation; will generate only if new commits exist.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #11570

coderabbitai bot added a commit that referenced this pull request Oct 15, 2025
Docstrings generation was requested by @minizhiren.

* #11568 (comment)

The following files were modified:

* `languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/WordCoherencyRule.java`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant