Skip to content

Patch for high CPU consumption when parsing PDF files

License

Notifications You must be signed in to change notification settings

mat02/alf-21970-repo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Alfresco Patch to High CPU consumption when parsing PDF files

Described at issue ALF-21970

PDF content extractors causing high CPU/memory with some PDFs.

Certain PDFs can cause OOM on the system when content extraction is run on them. It's been found that heap usage can easily be exhausted when attempting to extract the content when several of these files are uploaded at once. Uploading 10 of these files at once will likely cause an DoS.

When affected by this issue, following thread dump will be observed:

java.lang.Thread.State: RUNNABLE
 at java.lang.Object.hashCode(Native Method)
 at java.util.HashMap.hash(HashMap.java:338)
 at java.util.HashMap.put(HashMap.java:611)
 at org.apache.pdfbox.pdmodel.PDResources.reverseMap(PDResources.java:658)
 at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:332)
 at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:286)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:288)
 at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:220)
 at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
 at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
 at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
 at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
 at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:150)
 at org.alfresco.repo.content.transform.TikaPoweredContentTransformer.transformInternal(TikaPoweredContentTransformer.java:244)
 at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:250)
 at org.alfresco.repo.content.transform.AbstractContentTransformer2.transform(AbstractContentTransformer2.java:202)
 at org.alfresco.repo.web.scripts.solr.NodeContentGet.execute(NodeContentGet.java:206)

Sample PDF to produce this scenario is provided at Tika-Issue-Full-CPU.PDF

License The patch is licensed under the LGPL v3.0.

State Current patch release is 1.0.0

Compatibility The current version has been developed using Alfresco 201605 and Maven. It's also relevant for Alfresco 201707.

Downloading the ready-to-deploy-JAR

The binary distribution is made of one JAR file to be deployed in Alfresco as an endorsed lib:

You can install it by copying JAR file to $ALFRESCO_HOME/tomcat/endorsed and re-starting Alfresco.

Building the artifacts

You can build the artifacts from source code using maven $ mvn clean package

About

Patch for high CPU consumption when parsing PDF files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%