Skip to content

Add LowercaseTextProcessor for text normalization #9

@ArchishmanSengupta

Description

@ArchishmanSengupta

Description of the feature request:

It would be helpful to have a simple processor in genai_processors/contrib that converts all text in incoming ProcessorParts to lowercase. This would make it easier for users to build normalization pipelines.

Proposed API:
Location: genai_processors/contrib/lowercase_text_processor.py
Class: LowercaseTextProcessor
Inherits from: PartProcessor
Logic: If the part is text (is_text(part.mimetype)), convert to lowercase; else, yield unchanged. All metadata is preserved.

What problem are you trying to solve with this feature?

  • Tokenization might use "Hello", "hello", and "HELLO" as different number of tokens. Lowercasing ensures that "Hello", "hello", and "HELLO" are treated the same.

  • Improved search and matching

Any other information you'd like to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions