Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text with background missing from output. #225

Open
Quaddroo opened this issue Feb 27, 2025 · 0 comments
Open

Text with background missing from output. #225

Quaddroo opened this issue Feb 27, 2025 · 0 comments

Comments

@Quaddroo
Copy link

Reproduction pdf, free to download:
https://bmjopensem.bmj.com/content/bmjosem/1/1/e000050.full.pdf
When converting it to markdown with no special tricks, I notice most text with a colored background is fully missing. This is not the case for Table 2, but all other tables suffer from this issue. The text is definitely present in the pdf.

I suspect this may be related by this code being commented, but not sure:

pymupdf4llm/pymupdf4llm/helpers/multi_column.py

        # for i in range(len(new_rects) - 1, 0, -1):
        #     r = +new_rects[i]
        #     if in_bbox(r, path_rects):  # text with shaded background
        #         shadow_rects.insert(0, r)  # put in front to keep sequence
        #         del new_rects[i]

I couldn't reliably gauge if any other open issues are related to this, but it didn't seem like it. I'll try and debug it myself if noone comes to the rescue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant