Skip to content

Conversation

@DMwangnima
Copy link
Contributor

@DMwangnima DMwangnima commented Oct 23, 2025

What type of PR is this?

fix

Check the PR title.

  • This PR title match the format: <type>(optional scope): <description>
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo

(Optional) Translate the PR title into Chinese.

fix(retry): 混合重试在第一次调用成功后依然会执行 backoff 逻辑

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:
zh(optional):
bug 引入版本:Kitex v0.11.0,支持混合重试时该 bug 便存在。

(Optional) Which issue(s) this PR fixes:

(optional) The PR that updates user documentation:

@DMwangnima DMwangnima requested review from a team as code owners October 23, 2025 11:04
@DMwangnima DMwangnima marked this pull request as draft October 23, 2025 11:05
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.09%. Comparing base (bfa27f3) to head (bf6425b).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1879      +/-   ##
==========================================
- Coverage   61.37%   59.09%   -2.29%     
==========================================
  Files         388      388              
  Lines       35063    34458     -605     
==========================================
- Hits        21521    20364    -1157     
- Misses      12247    12808     +561     
+ Partials     1295     1286       -9     
Flag Coverage Δ
integration 50.56% <57.14%> (+0.06%) ⬆️
unit 51.67% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DMwangnima DMwangnima force-pushed the fix/mixed_retry_backoff branch from 4efbb3d to 609ac2e Compare October 23, 2025 11:06
@DMwangnima DMwangnima marked this pull request as ready for review October 23, 2025 11:22
@claude
Copy link

claude bot commented Oct 23, 2025

Pull Request Review - #1879

Summary

This PR fixes a bug in the mixed retry logic where backoff delays were being executed even when the first call succeeded. The fix reorders condition checks to first verify if a result should trigger a retry before checking if a retry is allowed.


Code Quality & Best Practices ✅

Positive aspects:

  • Clear bug fix: The logic reordering is straightforward and directly addresses the issue
  • Minimal change: The fix is surgical - only swapping the order of two condition checks
  • Preserves functionality: The nested conditions maintain the same error message handling logic

Suggestions:

  • The indentation and code structure remain clean and consistent with the existing codebase

Bug Analysis 🐛

Original Issue:
The bug occurred in pkg/retry/mixed_retryer.go:161-170. The original code checked ShouldRetry (which validates retry limits, circuit breakers, etc.) before checking isRetryResult (which determines if the result warrants a retry).

Root Cause:

// OLD (buggy) - lines 161-170
if msg, ok := r.ShouldRetry(ctx, nil, callCount, req, cbKey); ok {
    if r.isRetryResult(ctx, res.ri, res.resp, res.err, &r.policy.FailurePolicy) {
        // retry logic
    }
}

This caused the system to:

  1. Check if retry is allowed (passes)
  2. Then check if result needs retry (fails for successful call)
  3. But still wait for backoff before returning

The Fix:

// NEW (fixed) - lines 161-170
if r.isRetryResult(ctx, res.ri, res.resp, res.err, &r.policy.FailurePolicy) {
    if msg, ok := r.ShouldRetry(ctx, nil, callCount, req, cbKey); ok {
        // retry logic
    }
}

Now the system:

  1. First checks if the result needs retry (if successful, skip retry entirely)
  2. Only then checks if retry is allowed
  3. Returns immediately without unnecessary backoff delays

Impact: This prevents unnecessary latency when calls succeed on the first attempt.


Performance Considerations ⚡

Positive Impact:

  • Reduced latency: Successful first calls now return ~100ms faster (eliminating the backoff wait)
  • Early exit optimization: The reordered checks allow for earlier exits when results don't need retry
  • No performance regression: The change doesn't add any new operations or allocations

Test Evidence:
The new test TestMockCase5WithDiffRetry demonstrates the fix:

  • First call succeeds in ~100ms (down from ~200ms with the bug)
  • Test case validates timing expectations with 50ms tolerance

Security Concerns 🔒

No security issues identified:

  • The change is purely logical reordering with no security implications
  • Circuit breaker and rate limiting checks (ShouldRetry) are still performed appropriately
  • No new attack vectors introduced
  • Error message handling remains intact

Test Coverage ✅

Excellent test coverage:

The PR adds comprehensive test coverage via TestMockCase5WithDiffRetry with 4 test scenarios:

  1. First call succeeds: Validates no unnecessary backoff (100ms vs 200ms)
  2. Second call succeeds: Confirms proper retry with backoff (350ms)
  3. Third call succeeds: Tests max retry times with success (600ms)
  4. All calls fail: Validates failure path with retries (650ms)

Test Quality:

  • Uses realistic timing with RPC timeouts (150ms) and backoff delays (100ms)
  • Validates both success and failure paths
  • Checks call counts, error types, and timing constraints
  • Employs 50ms tolerance for timing assertions (appropriate for CI environments)

Minor note:

  • Line 240 has a test case labeled as "case5" in comment but "case4" would be more accurate (not blocking)

Recommendations

This PR is ready to merge with the following minor suggestions:

  1. Optional: Fix test comment - Line 240 in mixed_test.go:

    // case4: all error retry and trigger circuit breaker
    t.Run("case5", func(t *testing.T) {

    Consider aligning the comment and test name for consistency.

  2. Consider adding a comment in mixed_retryer.go:161 explaining the condition order:

    // Check if result needs retry first to avoid unnecessary backoff on success
    if r.isRetryResult(ctx, res.ri, res.resp, res.err, &r.policy.FailurePolicy) {

Conclusion

Verdict: LGTM ✅ (Looks Good To Me)

This is a high-quality bug fix that:

  • Correctly identifies and resolves the performance issue
  • Includes comprehensive test coverage
  • Has no security concerns or performance regressions
  • Follows the existing code patterns and style

The fix is minimal, well-tested, and improves user experience by eliminating unnecessary latency on successful calls.

Approval recommended.


Review performed by Claude Code

@DMwangnima DMwangnima force-pushed the fix/mixed_retry_backoff branch from 609ac2e to bf6425b Compare October 23, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants