Skip to content

Default pipeline generates many "unreadable" documents #433

@arnaudstiegler

Description

@arnaudstiegler

Hi,
I use Augraphy extensively but I've noticed that:

  • the default pipeline can be too destructive on my documents to the point where a human cannot read the text on it (see example below)
  • the only way to have a "milder" augmentation pipeline is to create a custom pipeline which requires listing out all the augmentations and is a bit cumbersome to experiment with (so many options).

It'd be great to either provide an option like "mild/strong" for the default pipeline to give some control over the default pipeline without needing to deep-dive into the internals of the package.

For instance, this doc is almost unreadable, and training models on unreadable docs can lead to really damaging behaviors like hallucinating answers completely on docs that they can't read
sample_105

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions