Skip to content

Conversation

@lucagobbi
Copy link

@lucagobbi lucagobbi commented Nov 20, 2025

Summary

This PR adds support for PDF documents in AWS Bedrock. I've introduced a to_bedrock() method on the PDF class and extended Bedrock's native content passthrough to handle document blocks.

Mentioned in #1924

Changes

1. PDF to Bedrock Conversion Method (instructor/processing/multimodal.py)

Added PDF.to_bedrock() method that converts PDF objects into Bedrock's document format with support for multiple source types:

  • S3 URIs: Converts s3://bucket/key paths to Bedrock's s3Location format
  • Local files: Reads file bytes and includes them in the bytes field
  • HTTP(S) URLs: Fetches and converts remote PDFs to bytes
  • Base64 data: Decodes existing base64 data to raw bytes

Key features:

  • Automatic name sanitization per Bedrock requirements (alphanumeric, whitespace, hyphens, parentheses, brackets only)
  • Smart name inference from file paths, URLs, or custom override
  • Proper error handling for invalid S3 URIs and missing data

2. Bedrock Native Document Passthrough (instructor/providers/bedrock/utils.py)

Extended _to_bedrock_content_items() to recognize and pass through Bedrock-native document blocks:

{"document": {"format": "pdf", "name": "...", "source": {"bytes": <raw bytes>}}}

This enables users to provide pre-formatted document blocks directly without conversion.

Example Usage

# S3 URI (no data transfer needed)
pdf = PDF(source="s3://my-bucket/document.pdf")
bedrock_content = pdf.to_bedrock()

# Local file
pdf = PDF.from_path("report.pdf")
bedrock_content = pdf.to_bedrock(name="Q4 Report")

# Direct Bedrock-native format
messages = [{
    "role": "user",
    "content": [
        {"text": "Summarize this document"},
        {"document": {"format": "pdf", "name": "doc", "source": {"bytes": pdf_bytes}}}
    ]
}]

Important

Adds PDF.to_bedrock() method for AWS Bedrock support, handling various PDF sources and extending document passthrough in Bedrock utilities.

  • Behavior:
    • Adds PDF.to_bedrock() in multimodal.py to convert PDFs to Bedrock format, supporting S3 URIs, local files, HTTP(S) URLs, and base64 data.
    • Handles name sanitization and error handling for invalid S3 URIs and missing data.
  • Utilities:
    • Extends _to_bedrock_content_items() in utils.py to pass through Bedrock-native document blocks.
  • Tests:
    • Adds tests for PDF.to_bedrock() with various sources in test_multimodal.py.
    • Adds test_bedrock_native_document_passthrough() in test_bedrock_native_passthrough.py.

This description was created by Ellipsis for 8b613ad. You can customize this summary. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 8b613ad in 2 minutes and 33 seconds. Click for details.
  • Reviewed 353 lines of code in 5 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/processing/multimodal.py:844
  • Draft comment:
    Consider adding a timeout for the HTTP GET request (e.g. requests.get(self.source, timeout=10)) to prevent hanging on slow or unresponsive servers.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 85% The comment is suggesting a best practice (adding timeouts to HTTP requests) which is generally good advice. However, looking at the rules, I need to check: 1) Is this clearly a code change that's required? The code will work without it. 2) Is this consistent with the rest of the codebase? I see mixed usage - some methods have timeouts, others don't. 3) Is this speculative or definite? The comment says "to prevent hanging" which is a potential issue, not a definite one. The comment seems to be suggesting a code quality improvement rather than pointing out a clear bug. While it's good advice, the codebase is inconsistent about this pattern, and the comment is somewhat speculative about the hanging issue. The comment could be considered valid because it's an actionable code quality improvement and timeouts are a best practice. The fact that some other methods in the same file do use timeouts (like from_gs_url) suggests this might be an expected pattern. This isn't purely speculative - missing timeouts can cause real production issues. While timeouts are best practice, the codebase shows inconsistent usage of this pattern. Many other requests.get() calls in the file don't have timeouts (lines 227, 302, 385, 691, 750, 776), so this isn't clearly an established pattern that must be followed. The comment is suggesting a potential improvement rather than pointing out a clear bug or inconsistency with established patterns in this specific file. This comment should be deleted. While adding timeouts is good practice, the codebase doesn't consistently apply this pattern, and the comment is more of a speculative suggestion about potential issues rather than identifying a clear bug or required change. The rules state not to make speculative comments and to only comment when there's clearly a code change required.
2. instructor/processing/multimodal.py:812
  • Draft comment:
    The filename sanitization regex removes periods (.) which eliminates file extensions. Confirm if this is intentional; preserving the extension might be useful.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% This is a code quality/potential bug comment about the sanitization logic. The comment identifies a specific issue: the regex removes periods, which eliminates file extensions. This seems like a legitimate concern since the name is often derived from actual filenames that would have extensions. The comment provides a concrete suggestion (adding "." to the regex pattern). However, the comment uses "Confirm if this is intentional" language, which violates the rule about not asking the PR author to confirm intentions. That said, the comment does provide a clear code suggestion that would fix the issue. The question is whether this is actually a problem - maybe Bedrock's API doesn't want or need file extensions in the name field? Without seeing Bedrock's documentation, it's hard to know if this is truly an issue. The comment asks the author to "confirm if this is intentional" which violates the rule against asking for confirmation. Additionally, without access to Bedrock's API documentation, I cannot verify whether file extensions should be preserved in the name field - this might be intentional sanitization per Bedrock's requirements. While the comment does ask for confirmation, it also provides a concrete, actionable code suggestion. The core observation (that periods are removed, eliminating extensions) is factually correct and the suggestion is clear. However, without knowing Bedrock's actual requirements, I cannot determine if this is truly a bug or intentional behavior. The comment should be deleted because it asks the author to confirm intentions ("Confirm if this is intentional"), which violates the review rules. Additionally, without evidence that preserving file extensions is required by Bedrock's API, this may be speculative.
3. instructor/processing/multimodal.py:857
  • Draft comment:
    Consider wrapping the base64 decoding (pdf_bytes = base64.b64decode(self.data)) in a try/except block to provide a clearer error message if decoding fails.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None

Workflow ID: wflow_58jfiFM3aBcM8iVn

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@jxnl
Copy link
Collaborator

jxnl commented Nov 20, 2025

@cursoragent please udpate this PR to include documentation updates and review this

@lucagobbi
Copy link
Author

@cursoragent please udpate this PR to include documentation updates and review this

Any news on this? I'm afraid the job wasnt triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants