Feature/bedrock document support #1936

lucagobbi · 2025-11-20T09:48:14Z

Summary

This PR adds support for PDF documents in AWS Bedrock. I've introduced a to_bedrock() method on the PDF class and extended Bedrock's native content passthrough to handle document blocks.

Mentioned in #1924

Changes

1. PDF to Bedrock Conversion Method (`instructor/processing/multimodal.py`)

Added PDF.to_bedrock() method that converts PDF objects into Bedrock's document format with support for multiple source types:

S3 URIs: Converts s3://bucket/key paths to Bedrock's s3Location format
Local files: Reads file bytes and includes them in the bytes field
HTTP(S) URLs: Fetches and converts remote PDFs to bytes
Base64 data: Decodes existing base64 data to raw bytes

Key features:

Automatic name sanitization per Bedrock requirements (alphanumeric, whitespace, hyphens, parentheses, brackets only)
Smart name inference from file paths, URLs, or custom override
Proper error handling for invalid S3 URIs and missing data

2. Bedrock Native Document Passthrough (`instructor/providers/bedrock/utils.py`)

Extended _to_bedrock_content_items() to recognize and pass through Bedrock-native document blocks:

{"document": {"format": "pdf", "name": "...", "source": {"bytes": <raw bytes>}}}

This enables users to provide pre-formatted document blocks directly without conversion.

Example Usage

# S3 URI (no data transfer needed)
pdf = PDF(source="s3://my-bucket/document.pdf")
bedrock_content = pdf.to_bedrock()

# Local file
pdf = PDF.from_path("report.pdf")
bedrock_content = pdf.to_bedrock(name="Q4 Report")

# Direct Bedrock-native format
messages = [{
    "role": "user",
    "content": [
        {"text": "Summarize this document"},
        {"document": {"format": "pdf", "name": "doc", "source": {"bytes": pdf_bytes}}}
    ]
}]

Important

Adds PDF.to_bedrock() method for AWS Bedrock support, handling various PDF sources and extending document passthrough in Bedrock utilities.

Behavior:
- Adds PDF.to_bedrock() in multimodal.py to convert PDFs to Bedrock format, supporting S3 URIs, local files, HTTP(S) URLs, and base64 data.
- Handles name sanitization and error handling for invalid S3 URIs and missing data.
Utilities:
- Extends _to_bedrock_content_items() in utils.py to pass through Bedrock-native document blocks.
Tests:
- Adds tests for PDF.to_bedrock() with various sources in test_multimodal.py.
- Adds test_bedrock_native_document_passthrough() in test_bedrock_native_passthrough.py.

^{This description was created by}^{for 8b613ad. You can customize this summary. It will automatically update as commits are pushed.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 8b613ad in 2 minutes and 33 seconds. Click for details.

Reviewed 353 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. instructor/processing/multimodal.py:844

Draft comment:
Consider adding a timeout for the HTTP GET request (e.g. requests.get(self.source, timeout=10)) to prevent hanging on slow or unresponsive servers.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 85% The comment is suggesting a best practice (adding timeouts to HTTP requests) which is generally good advice. However, looking at the rules, I need to check: 1) Is this clearly a code change that's required? The code will work without it. 2) Is this consistent with the rest of the codebase? I see mixed usage - some methods have timeouts, others don't. 3) Is this speculative or definite? The comment says "to prevent hanging" which is a potential issue, not a definite one. The comment seems to be suggesting a code quality improvement rather than pointing out a clear bug. While it's good advice, the codebase is inconsistent about this pattern, and the comment is somewhat speculative about the hanging issue. The comment could be considered valid because it's an actionable code quality improvement and timeouts are a best practice. The fact that some other methods in the same file do use timeouts (like from_gs_url) suggests this might be an expected pattern. This isn't purely speculative - missing timeouts can cause real production issues. While timeouts are best practice, the codebase shows inconsistent usage of this pattern. Many other requests.get() calls in the file don't have timeouts (lines 227, 302, 385, 691, 750, 776), so this isn't clearly an established pattern that must be followed. The comment is suggesting a potential improvement rather than pointing out a clear bug or inconsistency with established patterns in this specific file. This comment should be deleted. While adding timeouts is good practice, the codebase doesn't consistently apply this pattern, and the comment is more of a speculative suggestion about potential issues rather than identifying a clear bug or required change. The rules state not to make speculative comments and to only comment when there's clearly a code change required.

2. instructor/processing/multimodal.py:812

Draft comment:
The filename sanitization regex removes periods (.) which eliminates file extensions. Confirm if this is intentional; preserving the extension might be useful.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% This is a code quality/potential bug comment about the sanitization logic. The comment identifies a specific issue: the regex removes periods, which eliminates file extensions. This seems like a legitimate concern since the name is often derived from actual filenames that would have extensions. The comment provides a concrete suggestion (adding "." to the regex pattern). However, the comment uses "Confirm if this is intentional" language, which violates the rule about not asking the PR author to confirm intentions. That said, the comment does provide a clear code suggestion that would fix the issue. The question is whether this is actually a problem - maybe Bedrock's API doesn't want or need file extensions in the name field? Without seeing Bedrock's documentation, it's hard to know if this is truly an issue. The comment asks the author to "confirm if this is intentional" which violates the rule against asking for confirmation. Additionally, without access to Bedrock's API documentation, I cannot verify whether file extensions should be preserved in the name field - this might be intentional sanitization per Bedrock's requirements. While the comment does ask for confirmation, it also provides a concrete, actionable code suggestion. The core observation (that periods are removed, eliminating extensions) is factually correct and the suggestion is clear. However, without knowing Bedrock's actual requirements, I cannot determine if this is truly a bug or intentional behavior. The comment should be deleted because it asks the author to confirm intentions ("Confirm if this is intentional"), which violates the review rules. Additionally, without evidence that preserving file extensions is required by Bedrock's API, this may be speculative.

3. instructor/processing/multimodal.py:857

Draft comment:
Consider wrapping the base64 decoding (pdf_bytes = base64.b64decode(self.data)) in a try/except block to provide a clearer error message if decoding fails.
Reason this comment was not posted:
Confidence changes required: 80% <= threshold 85% None

Workflow ID: wflow_58jfiFM3aBcM8iVn

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

jxnl · 2025-11-20T14:05:47Z

@cursoragent please udpate this PR to include documentation updates and review this

lucagobbi · 2025-11-25T07:57:05Z

@cursoragent please udpate this PR to include documentation updates and review this

Any news on this? I'm afraid the job wasnt triggered.

lucagobbi added 2 commits November 20, 2025 10:27

feat(bedrock): add document support

0ae7d71

feat(bedrock): add to_bedrock method for PDF handling

8b613ad

ellipsis-dev bot reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature/bedrock document support #1936

Feature/bedrock document support #1936

Uh oh!

lucagobbi commented Nov 20, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

jxnl commented Nov 20, 2025

Uh oh!

lucagobbi commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Feature/bedrock document support #1936

Are you sure you want to change the base?

Feature/bedrock document support #1936

Uh oh!

Conversation

lucagobbi commented Nov 20, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. PDF to Bedrock Conversion Method (instructor/processing/multimodal.py)

2. Bedrock Native Document Passthrough (instructor/providers/bedrock/utils.py)

Example Usage

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

jxnl commented Nov 20, 2025

Uh oh!

lucagobbi commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucagobbi commented Nov 20, 2025 •

edited by ellipsis-dev bot

Loading

1. PDF to Bedrock Conversion Method (`instructor/processing/multimodal.py`)

2. Bedrock Native Document Passthrough (`instructor/providers/bedrock/utils.py`)