A standalone Java library/command line tool that converts DOC, DOCX, PPT, PPTX and ODT documents to pdf files. (Requires JRE 6)
Why?
I wanted a simple program that can convert Microsoft Office documents to PDF but without dependencies like LibreOffice or expensive proprietary solutions. Seeing as how code and libraries to convert each individual format is scattered around the web, I decided to combine all those solutions into one single program. Along the way, I decided to add ODT support as well since I encountered the code too.
Command Line Usage:
java -jar doc-converter.jar -type "type" -input "path" -output "path" -verbose
java -jar doc-converter.jar -input test.doc
java -jar doc-converter.jar -i test.ppt -o ~\output.pdf
java -jar doc-converter.jar -i ~\no-extension-file -o ~\output.pdf -t docx
Parameters:
-inputPath (-i, -in, -input) "path" : specifies a path for the input file
-outputPath (-o, -out, -output) "path" : specifies a path for the output PDF, use input file directory and name.pdf if not specified (Optional)
-type (-t) [DOC | DOCX | PPT | PPTX | ODT] : Specifies doc converter. Leave blank to let program infer via file extension (Optional)
-verbose (-v) : To view intermediate processing messages. (Optional)
Library Usage:
- Drop the jar into your lib folder and add to build path.
- Choose the converter of your choice, they are named DocToPDFConverter, DocxToPDFConverter, PptToPDFConverter, PptxToPDFConverter and OdtToPDFConverter.
- Instantiate with 4 parameters 3a: InputStream inStream: Document source stream to be converted
- Call the "convert()" method and wait.
Caveats and technical details:
This tool relies on Apache POI, xdocreport, docx4j and odfdom libraries. They are not 100% reliable and the output format may not always be what you desire.DOC:
Generally ok but takes some time to convert.. I notice that after conversion, the paragraph spacing tends to increase affecting your page layout. Conversion is done using docx4j to convert DOC to DOCX then to PDF.
(Cannot use xdocreport once the DOCX data is obtained as the intermediate data structure is docx4j specific.)DOCX:
Very good results. Fast conversion too. Conversion is done using xdocreport library as it seems faster and more accurate than docx4j.PPT and PPTX:
Resulting file is a PDF comprising of a PNG embedded in each page. Should be good enough for printing. This is the limitation of the Apache POI and docx4j libraries.ODT:
Quality and speed as good as DOCX. Conversion is done using odfdom of the Apache ODF Toolkit.Main Libraries
Apache POI: https://poi.apache.org/
xdocreport: http://code.google.com/p/xdocreport/
docx4j: http://www.docx4java.org/
odfdom: https://incubator.apache.org/odftoolkit/odfdom/
and others...
The MIT License (MIT)
Copyright (c) 2013-2014 Yeo Kheng Meng
3b: OutputStream outStream: Document output stream
3c: boolean showMessages: Whether to show intermediate processing messages to Standard Out (stdout)
3d: boolean closeStreamsWhenComplete: Whether to close input and output streams when complete