Skip to content

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

@shula

Description

@shula

When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)

in the example sample:

I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.

Command line used:
java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv

The bogus lines are identified / starts with: 1068, 1103
Output lines with the problem:
43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068
60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103

In the output, i see 2 phenomena:

  1. the wrong text "A10L YCPCT" should've been: "10 CC"
  2. the wrong text "E209" should've been: "29". etc.
  3. the word "EUCALIPTUS" is cut in these lines. This makes sense, since it's not visible, and therefore, not a real bug.

in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".

My setup:

  • windows 10
  • java version "1.8.0_401"
  • tabula 1.0.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions