Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images #437

Closed
Scoutink opened this issue May 6, 2024 · 6 comments
Closed

Images #437

Scoutink opened this issue May 6, 2024 · 6 comments
Assignees

Comments

@Scoutink
Copy link

Scoutink commented May 6, 2024

Great work!
Certainly the best I came across and I've been searching...
Can't you make it also extract images from the pdfs? I gave it a complex pdf with text tables and images and it was impressive beside missing on extracting the images.

@NastyBoget
Copy link
Collaborator

Thank you for your feedback!
Extracting images from PDFs should work with parameter with_attachments="true".
You can get more details in the documentation:

  1. If you use dedoc API, parameters description and return format description may help.
  2. If you use dedoc library, this page can be useful.

Feel free to ask any questions if something doesn't work properly.

@NastyBoget NastyBoget self-assigned this May 7, 2024
@Scoutink
Copy link
Author

Scoutink commented May 7, 2024

@NastyBoget Thank you for the feedback. For some reason it is not working for me. Not sure if this is appropriate to ask here, but are you open to help me using this in a project I am working on (for a fee we can discuss, of course). I may even point you to another repo with an interesting approach to this same field that may improve the results.

@NastyBoget
Copy link
Collaborator

If you have an example PDF file, I can try to process it myself. Maybe we have a bug on some specific cases.

@Scoutink
Copy link
Author

Scoutink commented May 7, 2024

[sample.pdf]
here is a sample I downloaded from the web. I am testing here, btw: https://dedoc-readme.hf.space/

@NastyBoget
Copy link
Collaborator

NastyBoget commented May 7, 2024

Did you try to set return_format into json? In this case, images are enlisted in attachments field (tried it myself). Currently, we don't show attachments in the HTML representation (it will be added in the next version of dedoc). We use HTML representation for better understanding of dedoc's parsing results, but in practice (in our projects) we use json representation.

By the way, I would recommend to work with attached images using dedoc library (not API), or use return_base64 parameter for storing images content in metadata.

@Scoutink
Copy link
Author

Scoutink commented May 7, 2024

That makes sense, and very helpful. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants