-
Notifications
You must be signed in to change notification settings - Fork 1.5k
extend JLanguageTool to allow customising used rules easier #11655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdded runtime custom-rule support to JLanguageTool (new constructor, setRules, and checkInternalWithCustomRules), introduced RemoteRuleResult.adjustOffsets with related constructors and accessor, and centralized textSessionID computation in TextChecker via a new computeTextSessionID helper. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (3)
320-365: Custom-rules constructor behavior is reasonable but leaks the caller’s list and ignoreswithLanguageModelThe new 10‑arg constructor wires
customRulesstraight intobuiltinRulesand skips all default-rule initialization, which matches the high-level intent. Two caveats:
builtinRules = customRules;means the caller’s list is now the internal mutable list. Later calls likesetRules()(which clearsbuiltinRules) or any other in‑place operations will also mutate the original list reference the caller may still hold. Consider defensively copying: e.g. wrappingcustomRulesin a newArrayListbefore assigning tobuiltinRules.- When
customRules != null, thewithLanguageModelflag and the “optional LM rules” path are effectively ignored; only the caller-supplied rules are used. That’s probably fine, but it might be worth documenting explicitly in the new Javadoc so users don’t expect LM-capable rules to be auto-attached in this mode.If you agree on the first point, a small refactor like this would avoid aliasing:
- ResourceBundle messages = ResourceBundleTools.getMessageBundle(language); - if (customRules != null) { - builtinRules = customRules; - } else { - builtinRules = getAllBuiltinRules(language, messages, userConfig, globalConfig); + ResourceBundle messages = null; + if (customRules != null) { + builtinRules = new ArrayList<>(customRules); + } else { + messages = ResourceBundleTools.getMessageBundle(language); + builtinRules = getAllBuiltinRules(language, messages, userConfig, globalConfig); try { activateDefaultPatternRules(); ...
805-817:setRulesclears built‑ins but leaves enable/disable state and categories intactThe new
setRules(List<Rule> rules)does a hard reset of the internal rule lists:builtinRules.clear(); userRules.clear(); userRules.addAll(rules); ruleSetCache.clear();A few behavioral implications worth double‑checking:
You’re clearing
builtinRulesentirely and placing all new rules intouserRules. Because other code usually treats both lists similarly this likely works, but semantically it means “everything is now user rules”. If the intent is “replace the complete active rule set”, you might instead want the new rules to live inbuiltinRulesand keepuserRulesas the layer for incremental additions:public void setRules(List<Rule> rules) {builtinRules.clear();
userRules.clear();
userRules.addAll(rules);
- builtinRules.clear();
- builtinRules.addAll(rules);
- userRules.clear();
ruleSetCache.clear();
}- The method doesn’t reset `disabledRules`, `enabledRules`, `disabledRuleCategories`, or `enabledRuleCategories`. Those historical flags will still affect the newly set rules if IDs happen to collide, which could be surprising when a caller expects a clean slate. Either clearing those sets here or clarifying in the Javadoc that enable/disable state is preserved would make the behavior less surprising. Given this is new public API, aligning implementation and documentation now will avoid confusing interactions later. --- `1081-1086`: **Refactor to `checkInternalWithCustomRules` looks correct; consider tightening the public API surface** The split between: - `checkInternal(...)` → derives `RuleSet` via `getActiveRulesForLevelAndToneTags` and - `public CheckResults checkInternalWithCustomRules(RuleSet rules, ...)` → runs the full pipeline is clean and appears to preserve the previous behavior: remote rules, caching, GRPC post‑processing, and final `filterMatches` all go through the same path, and tone/level filtering is applied consistently via `isRuleActiveForLevelAndToneTags` in `filterMatches`. Two small API points to consider: - `checkInternalWithCustomRules` is now public but accepts a raw `RuleSet` without any Javadoc. For external callers it’s not obvious that: - They should generally pass a `RuleSet` built similarly to the internal one (e.g. `RuleSet.textLemmaHinted(...)`), and - Level/tone filtering still applies at the match level, regardless of how `rules` was constructed. A short Javadoc block spelling this out (and clarifying that disable/enable flags and tone tags are still honored in `filterMatches`) would help prevent misuse. - The `rules` parameter is expected to be non‑null and compatible with the rest of the pipeline (e.g. `rules.rulesForSentence(...)` must be valid). You might want to annotate it as `@NotNull` to make that contract explicit. Otherwise, the factoring and wiring into the existing check flow look solid. Also applies to: 1088-1176, 1178-1198 </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: CodeRabbit UI **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 324ca0e32e89aa73d89b18761ebfbb157215222b and 5db7122d7a84a507f389c4e5faa53eee6a348811. </details> <details> <summary>📒 Files selected for processing (1)</summary> * `languagetool-core/src/main/java/org/languagetool/JLanguageTool.java` (4 hunks) </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary> * GitHub Check: Analyze (java-kotlin) </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
languagetool-core/src/main/java/org/languagetool/rules/RemoteRuleResult.java (2)
60-62: Consider documenting the defaultadjustOffsets = truebehaviorSince this constructor preserves the old signature, it will likely remain the most common entry point. A short Javadoc explaining that it defaults
adjustOffsetstotruewould make the behavior explicit for callers.- public RemoteRuleResult(boolean remote, boolean success, List<RuleMatch> matches, List<AnalyzedSentence> processedSentences) { + /** + * Creates a {@code RemoteRuleResult} assuming that rule matches are sentence-relative + * and thus require offset adjustment (i.e. {@code adjustOffsets} is {@code true}). + */ + public RemoteRuleResult(boolean remote, boolean success, List<RuleMatch> matches, List<AnalyzedSentence> processedSentences) { this(remote, success, true, matches, processedSentences); }
72-74: Optional: add Javadoc to clarify what “adjustOffsets” means for callersGiven the slightly imperative method name (
adjustOffsets()vs.isXxx()), a brief Javadoc would help make the contract obvious at call sites (especially when readingif (result.adjustOffsets()) { ... }).- public boolean adjustOffsets() { - return adjustOffsets; - } + /** + * @return {@code true} if the {@link RuleMatch} offsets are relative to their + * sentences and still need to be adjusted to the full text; {@code false} + * if the matches already use full-text offsets and must not be adjusted. + */ + public boolean adjustOffsets() { + return adjustOffsets; + }languagetool-server/src/main/java/org/languagetool/server/TextChecker.java (1)
459-459: ChangedtextSessionIdsemantics – confirm downstream expectations
textSessionIdis now always derived viacomputeTextSessionID(params.get("textSessionId"), remoteAddress), i.e., a SHA‑256‑based 64‑bit hash of either the client-supplied string or the IP (and possiblynullif you adopt the defensive change above), instead of parsing a long and falling back on failure.This is a behavioral change that affects any consumers that log, persist, or interpret
textSessionId(e.g., database logs, A/B tests, remote rule infrastructure). Please double-check that:
- No callers rely on being able to pass a specific numeric
textSessionIdvalue (e.g., for correlation across systems).- Schema/range assumptions for stored
textSessionIdvalues are still valid with potentially negative 64‑bit values and different distribution.If needed, you might want a brief migration note or comment to make this intentional change clear.
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (2)
805-817:setRulesshould also reset language-model rule bookkeeping; consider clarifying interaction with enable/disable sets
setRulescurrently clearsbuiltinRulesanduserRules, then installs the new rules intouserRulesand clears theruleSetCache. However:
optionalLanguageModelRulesis left untouched. IfupdateOptionalLanguageModelRules(...)is called aftersetRules, it will remove rules from the newuserRuleswhose IDs happen to match stale entries from the previous configuration.disabledRules,enabledRules,disabledRuleCategories, andenabledRuleCategoriesare also preserved, which may or may not be desired when doing a full ruleset replacement (rules with reused IDs could be unexpectedly filtered).At minimum, I’d strongly recommend clearing the LM bookkeeping to avoid subtle removal of unrelated rules:
public void setRules(List<Rule> rules) { builtinRules.clear(); userRules.clear(); userRules.addAll(rules); + optionalLanguageModelRules.clear(); ruleSetCache.clear(); }If the intent of
setRulesis to fully replace the ruleset, you may also want to document (or reconsider) how the enable/disable sets interact with this API.
1081-1176: NewcheckInternalWithCustomRulesAPI looks good; consider tightening nullability and adding JavadocThe refactor to have
checkInternal(...)compute the activeRuleSetand delegate tocheckInternalWithCustomRules(...)is clean and appears behavior‑preserving. It also provides a useful hook for callers that want to supply their ownRuleSet.To make the new public API safer and more self‑documenting:
- Mark
rulesas non‑null (consistent with other parameters already using@NotNull).- Add a short Javadoc comment that explains when
checkInternalWithCustomRulesshould be used and how it differs fromcheck(...)/check2(...).Example tweak:
- public CheckResults checkInternalWithCustomRules(RuleSet rules, AnnotatedText annotatedText, ParagraphHandling paraMode, RuleMatchListener listener, + public CheckResults checkInternalWithCustomRules(@NotNull RuleSet rules, AnnotatedText annotatedText, ParagraphHandling paraMode, RuleMatchListener listener, Mode mode, Level level, @NotNull Set<ToneTag> toneTags, @Nullable Long textSessionID, List<String> sentences, List<AnalyzedSentence> analyzedSentences) throws IOException {This keeps the surface consistent with the rest of the class and helps avoid accidental
nullpassing from external callers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java(6 hunks)languagetool-core/src/main/java/org/languagetool/rules/RemoteRuleResult.java(3 hunks)languagetool-server/src/main/java/org/languagetool/server/TextChecker.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (4)
languagetool-core/src/main/java/org/languagetool/rules/RemoteRuleResult.java (2)
32-33:adjustOffsetsflag and comment are consistent with existing responsibilitiesThe new
adjustOffsetsfield cleanly captures whether match offsets are sentence-relative vs. already aligned, and the inline comment is clear enough for future maintainers. No issues from my side here.
41-45: Constructor extension looks correct and preserves internal invariantsWiring the new
adjustOffsetsparameter straight into the field while leaving the rest of the initialization untouched keeps behavior predictable. Combined with the delegating ctor below, this setup should preserve existing behavior while enabling the new mode.languagetool-server/src/main/java/org/languagetool/server/TextChecker.java (1)
47-53: New imports are appropriate for hashing helperThe added imports (ByteBuffer, StandardCharsets, MessageDigest, NoSuchAlgorithmException) match the new computeTextSessionID implementation and are scoped reasonably; no issues here.
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (1)
1354-1372: Offset adjustment and caching now correctly respectRemoteRuleResult.adjustOffsets()The changes here (conditioning both cache writes and subsequent
adjustOffset(...)calls onresult.adjustOffsets()) look correct and avoid double‑adjusting matches for remote rules that already return offsets in the final coordinate space. They also prevent such matches from entering the remote‑match cache, which would otherwise be applied withadjustOffset(...)unconditionally on cache reads.This is a good trade‑off: correctness is preserved for both legacy and new remote rules, and caching is still used where offset adjustment is safe.
languagetool-server/src/main/java/org/languagetool/server/TextChecker.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (1)
297-300: Constructor ignoresinputLoggingargument and always enables loggingIn the 7-arg constructor
public JLanguageTool(Language language, List<Language> altLanguages, Language motherTongue, ResultCache cache, GlobalConfig globalConfig, UserConfig userConfig, boolean inputLogging)the
inputLoggingparameter is ignored:this(language, altLanguages, motherTongue, cache, globalConfig, userConfig, true, false);This hard-codes
inputLogging = true, so callers cannot disable input logging anymore (and the signature/Javadoc become misleading). It also conflates the default forwithLanguageModelwith the logging flag.Suggested fix:
public JLanguageTool(Language language, List<Language> altLanguages, Language motherTongue, ResultCache cache, GlobalConfig globalConfig, UserConfig userConfig, boolean inputLogging) { - this(language, altLanguages, motherTongue, cache, globalConfig, userConfig, true, false); + this(language, altLanguages, motherTongue, cache, globalConfig, userConfig, inputLogging, false); }This restores the intended behavior: callers control logging via
inputLogging, whilewithLanguageModelstill defaults tofalse.Also applies to: 320-322
♻️ Duplicate comments (2)
languagetool-server/src/main/java/org/languagetool/server/TextChecker.java (1)
47-53: computeTextSessionID: good centralization; consider treating empty param as “unset” and documenting nullabilityThe helper correctly centralizes textSessionId derivation, avoids the previous parsing/NumberFormatException mess, and now:
- Handles the “both null” case via early
return null.- Has removed the uncontrolled
System.outlogging of IDs/IPs.One remaining behavioral edge case: an empty
textSessionIdParamcurrently overridesipand hashes"", giving all such requests the same session ID. Previously, an empty/invalid value would effectively fall back to IP. If you still want IP-based separation in that case, consider:- protected static Long computeTextSessionID(String textSessionIdParam, String ip) { - String input = textSessionIdParam != null ? textSessionIdParam : ip; - if (input == null) { - return null; - } + @Nullable + protected static Long computeTextSessionID(String textSessionIdParam, String ip) { + String param = (textSessionIdParam != null && !textSessionIdParam.isEmpty()) + ? textSessionIdParam + : null; + String input = param != null ? param : ip; + if (input == null || input.isEmpty()) { + // No stable basis for a session id + return null; + } try { MessageDigest md = MessageDigest.getInstance("SHA-256"); byte[] bytes = md.digest(input.getBytes(StandardCharsets.UTF_8)); ByteBuffer buffer = ByteBuffer.wrap(bytes); - Long textSessionId = buffer.getLong(); - return textSessionId; + return buffer.getLong(); } catch (NoSuchAlgorithmException e) { // Should not happen for SHA-256, wrap in a runtime exception throw new RuntimeException("SHA-256 not supported", e); } }This preserves your new hashing approach, avoids NPE/PII issues, and restores more sensible behavior when the client sends an empty
textSessionId.Also applies to: 269-288, 461-461
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (1)
342-365: Custom-rules constructor and defensive copy look correctThe new 9-arg constructor cleanly wires
customRulesinto initialization and, importantly, wraps them innew ArrayList<>(customRules), so internal mutation ofbuiltinRuleswon’t affect the caller’s list or blow up on unmodifiable lists. The fallback path forcustomRules == nullpreserves previous behavior (built-in rules, default activations, optional language model rules setup).This API surface for custom rule initialization looks sound.
🧹 Nitpick comments (1)
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (1)
805-817: ClearoptionalLanguageModelRuleswhen replacing rules viasetRules
setRulescorrectly clearsbuiltinRules,userRules, andruleSetCachebefore installing the new rule list. However,optionalLanguageModelRulesis left untouched:public void setRules(List<Rule> rules) { builtinRules.clear(); userRules.clear(); userRules.addAll(rules); ruleSetCache.clear(); }If the instance previously had language-model-capable rules loaded,
optionalLanguageModelRulescan still contain their IDs. A later call toupdateOptionalLanguageModelRules(...)will then remove any newly set rules with those IDs and replace them with LM-specific variants, which is surprising whensetRuleswas expected to take full control of the active set.Consider explicitly resetting this internal LM bookkeeping:
public void setRules(List<Rule> rules) { builtinRules.clear(); userRules.clear(); + optionalLanguageModelRules.clear(); userRules.addAll(rules); ruleSetCache.clear(); }This keeps the LM-related state consistent with the new rule set and avoids hard-to-debug interactions for callers using custom rules plus language models.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java(6 hunks)languagetool-server/src/main/java/org/languagetool/server/TextChecker.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (1)
languagetool-core/src/main/java/org/languagetool/JLanguageTool.java (1)
1082-1090: checkInternalWithCustomRules +adjustOffsetsgating for remote rules looks correctThe refactor to:
RuleSet rules = getActiveRulesForLevelAndToneTags(level, toneTags); return checkInternalWithCustomRules(rules, ...);plus the new
public CheckResults checkInternalWithCustomRules(...)cleanly generalizes the internal check pipeline to arbitraryRuleSets without duplicating logic, and maintains existing behavior for the standard path.In
fetchResults, the newadjustOffsetsusage:if (cache != null && result.isSuccess() && result.adjustOffsets()) { // store in cache ... cacheEntry.put(ruleKey, matches); } ... List<RuleMatch> adjustedMatches = matches.stream().map(RuleMatch::new).collect(Collectors.toList()); if (result.adjustOffsets()) { for (RuleMatch match : adjustedMatches) { adjustOffset(annotatedText, offset, match); } }ensures that:
- Only results whose offsets are sentence-relative are cached under
InputSentencekeys and later re-offset.- Results that already carry document-relative offsets are neither cached under the sentence key nor offset a second time.
This matches the new
RemoteRuleResult.adjustOffsetscontract and avoids double-adjusting or mis-caching remote matches.Also applies to: 1354-1372
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
languagetool-core/src/main/java/org/languagetool/rules/RemoteRule.java (1)
193-225: ConfirmadjustOffsetssemantics when re‑wrappingRemoteRuleResultYou correctly preserve
isRemote/isSuccessand theadjustOffsets()flag when creating newRemoteRuleResultinstances afterfilterMatchesand aftersuppressMisspelled, so any downstream handling based on that flag is not silently lost.One thing to double‑check: above this block,
fixMatchOffsets(...)may already have rewritten the offsets in place. IfadjustOffsetsis interpreted as “consumer still needs to adjust offsets”, you might now be propagating a flag whose precondition has been fulfilled locally, which could lead to double‑adjustment elsewhere. If, instead, the flag means “offsets have been adjusted already” (or is just metadata about the remote), the current propagation is fine. It’s worth verifying that the boolean’s semantics and thefixOffsetsbehavior are aligned.As a minor, optional readability tweak, you could avoid double‑wrapping by performing both filtering steps into a single final
List<RuleMatch>and constructingRemoteRuleResultonly once at the end, and/or using more specific names than a secondfilteredMatchesfor the post‑suppression list.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
languagetool-core/src/main/java/org/languagetool/rules/RemoteRule.java(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (java-kotlin)
Summary by CodeRabbit
New Features
Improvements
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.