-
-
Notifications
You must be signed in to change notification settings - Fork 899
Feature/bedrock document support #1936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/bedrock document support #1936
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to 8b613ad in 2 minutes and 33 seconds. Click for details.
- Reviewed
353lines of code in5files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/processing/multimodal.py:844
- Draft comment:
Consider adding a timeout for the HTTP GET request (e.g. requests.get(self.source, timeout=10)) to prevent hanging on slow or unresponsive servers. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 85% The comment is suggesting a best practice (adding timeouts to HTTP requests) which is generally good advice. However, looking at the rules, I need to check: 1) Is this clearly a code change that's required? The code will work without it. 2) Is this consistent with the rest of the codebase? I see mixed usage - some methods have timeouts, others don't. 3) Is this speculative or definite? The comment says "to prevent hanging" which is a potential issue, not a definite one. The comment seems to be suggesting a code quality improvement rather than pointing out a clear bug. While it's good advice, the codebase is inconsistent about this pattern, and the comment is somewhat speculative about the hanging issue. The comment could be considered valid because it's an actionable code quality improvement and timeouts are a best practice. The fact that some other methods in the same file do use timeouts (likefrom_gs_url) suggests this might be an expected pattern. This isn't purely speculative - missing timeouts can cause real production issues. While timeouts are best practice, the codebase shows inconsistent usage of this pattern. Many otherrequests.get()calls in the file don't have timeouts (lines 227, 302, 385, 691, 750, 776), so this isn't clearly an established pattern that must be followed. The comment is suggesting a potential improvement rather than pointing out a clear bug or inconsistency with established patterns in this specific file. This comment should be deleted. While adding timeouts is good practice, the codebase doesn't consistently apply this pattern, and the comment is more of a speculative suggestion about potential issues rather than identifying a clear bug or required change. The rules state not to make speculative comments and to only comment when there's clearly a code change required.
2. instructor/processing/multimodal.py:812
- Draft comment:
The filename sanitization regex removes periods (.) which eliminates file extensions. Confirm if this is intentional; preserving the extension might be useful. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% This is a code quality/potential bug comment about the sanitization logic. The comment identifies a specific issue: the regex removes periods, which eliminates file extensions. This seems like a legitimate concern since the name is often derived from actual filenames that would have extensions. The comment provides a concrete suggestion (adding "." to the regex pattern). However, the comment uses "Confirm if this is intentional" language, which violates the rule about not asking the PR author to confirm intentions. That said, the comment does provide a clear code suggestion that would fix the issue. The question is whether this is actually a problem - maybe Bedrock's API doesn't want or need file extensions in the name field? Without seeing Bedrock's documentation, it's hard to know if this is truly an issue. The comment asks the author to "confirm if this is intentional" which violates the rule against asking for confirmation. Additionally, without access to Bedrock's API documentation, I cannot verify whether file extensions should be preserved in the name field - this might be intentional sanitization per Bedrock's requirements. While the comment does ask for confirmation, it also provides a concrete, actionable code suggestion. The core observation (that periods are removed, eliminating extensions) is factually correct and the suggestion is clear. However, without knowing Bedrock's actual requirements, I cannot determine if this is truly a bug or intentional behavior. The comment should be deleted because it asks the author to confirm intentions ("Confirm if this is intentional"), which violates the review rules. Additionally, without evidence that preserving file extensions is required by Bedrock's API, this may be speculative.
3. instructor/processing/multimodal.py:857
- Draft comment:
Consider wrapping the base64 decoding (pdf_bytes = base64.b64decode(self.data)) in a try/except block to provide a clearer error message if decoding fails. - Reason this comment was not posted:
Confidence changes required:80%<= threshold85%None
Workflow ID: wflow_58jfiFM3aBcM8iVn
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
|
@cursoragent please udpate this PR to include documentation updates and review this |
Any news on this? I'm afraid the job wasnt triggered. |
Summary
This PR adds support for PDF documents in AWS Bedrock. I've introduced a
to_bedrock()method on thePDFclass and extended Bedrock's native content passthrough to handle document blocks.Mentioned in #1924
Changes
1. PDF to Bedrock Conversion Method (
instructor/processing/multimodal.py)Added
PDF.to_bedrock()method that converts PDF objects into Bedrock's document format with support for multiple source types:s3://bucket/keypaths to Bedrock'ss3LocationformatbytesfieldKey features:
2. Bedrock Native Document Passthrough (
instructor/providers/bedrock/utils.py)Extended
_to_bedrock_content_items()to recognize and pass through Bedrock-native document blocks:{"document": {"format": "pdf", "name": "...", "source": {"bytes": <raw bytes>}}}This enables users to provide pre-formatted document blocks directly without conversion.
Example Usage
Important
Adds
PDF.to_bedrock()method for AWS Bedrock support, handling various PDF sources and extending document passthrough in Bedrock utilities.PDF.to_bedrock()inmultimodal.pyto convert PDFs to Bedrock format, supporting S3 URIs, local files, HTTP(S) URLs, and base64 data._to_bedrock_content_items()inutils.pyto pass through Bedrock-native document blocks.PDF.to_bedrock()with various sources intest_multimodal.py.test_bedrock_native_document_passthrough()intest_bedrock_native_passthrough.py.This description was created by
for 8b613ad. You can customize this summary. It will automatically update as commits are pushed.