Skip to content

Conversation

@Sameerlite
Copy link
Collaborator

@Sameerlite Sameerlite commented Oct 30, 2025

Title

Add Prometheus metric to track callback logging failures

Relevant issues

Adds monitoring for callback health - tracks when S3, Langfuse, and other callbacks fail to log events.

Pre-Submission checklist

  • I have Added testing in the tests/litellm/ directory
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🆕 New Feature

Changes

New Prometheus Metric

Added litellm_callback_logging_failures_metric to track when callbacks (S3, Langfuse, etc.) fail to log events.

Metric:

  • Name: litellm_callback_logging_failures_metric
  • Type: Counter
  • Label: callback_name (e.g., "S3Logger", "LangFuseLogger")

Example:

litellm_callback_logging_failures_metric_total{callback_name="S3Logger"} 5.0
litellm_callback_logging_failures_metric_total{callback_name="LangFuseLogger"} 2.0

Files Modified

  1. enterprise/litellm_enterprise/integrations/prometheus.py

    • Added metric definition (line 302-306)
    • Added increment_callback_logging_failure() method (line 1733-1750)
  2. litellm/integrations/custom_logger.py

    • Added handle_callback_failure() method that all callbacks can use (line 571-624)
  3. litellm/integrations/s3_v2.py

    • Modified exception handlers to call handle_callback_failure() on upload failures
    • Tracks failures in async and sync upload methods
    • Changes here were done because we don't raise the error from here. So it makes sense to just call the method here itself
image

@vercel
Copy link

vercel bot commented Oct 30, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Error Error Oct 31, 2025 11:40am

except Exception as e:
verbose_logger.exception(f"s3 Layer Error - {str(e)}")
pass
self.handle_callback_failure(callback_name="S3Logger")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which would be less work over time:

  • requiring each instance to implement this
  • OR having integrations just bubble the error and have litellm_logging handle this?
    @Sameerlite

Copy link
Collaborator Author

@Sameerlite Sameerlite Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krrishdholakia 2nd one is less work but the problem is periodic_flush and all method used in it don't raise error or propagate it to litellm_logging. Plus there are tasks which are fire-and-forget which I wasn't able find a way to bubble up those errors. The method I used was making sure that if an error comes, it will get logged in Prometheus

Base automatically changed from litellm_container_proxy_integration to litellm_sameer_oct_staging_2 October 31, 2025 03:02
Add proxy support to container apis & logging support (#16049)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants