Skip to content

Commit

Permalink
Fixes README
Browse files Browse the repository at this point in the history
  • Loading branch information
joshuathayer committed May 6, 2015
1 parent c0e27da commit 80d7869
Showing 1 changed file with 12 additions and 8 deletions.
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,10 +152,10 @@ var.
### Metadata and Text Extraction

`pantomime.extract` provides two functions for extracting metadata,
content, and embedded files from from byte arrays, java.io.InputStream
and java.net.URL instances as well as filenames as strings and
java.io.File instances. The functions differ in how they handle
embedded documents.
content, and embedded files from byte arrays, java.io.InputStream and
java.net.URL instances as well as filenames as strings and
java.io.File instances. The extraction functions differ in how they
handle embedded documents.

`pantomime.extract/parse` takes as its single argument any of the
types mentioned above. It returns a map containing all the metadata
Expand Down Expand Up @@ -185,16 +185,18 @@ An example:
`pantomime.extract/parse-extract-embedded` also returns Tika-extracted
metadata and document text, but it handles embedded documents
differently. Instead of returning the concatenation of all embedded
document text, it saves each embedded file to the filesystem, and
includes a vector of file names and paths in the returned data.
document text, it saves each embedded file to the filesystem and
includes a vector of file names and paths in the returned
data. Remember to remove those files when you're done with them!

For example, the file `fileAttachment.pdf` contains a single attached file, which gets saved to `/tmp/pantomime1430952739353-590574117`:
For example, the file `fileAttachment.pdf` contains a single attached
file, which gets saved to `/tmp/pantomime1430952739353-590574117`:

``` clojure
(require [clojure.java.io :as io]
[pantomime.extract :as extract])

(pprint (extract/parse "test/resources/pdf/fileAttachment.pdf"))
(pprint (extract/parse-extract-embedded "test/resources/pdf/fileAttachment.pdf"))

;= {:date ("2012-11-23T14:40:50Z"),
;= :producer ("Acrobat Distiller 9.5.2 (Windows)"),
Expand All @@ -208,6 +210,8 @@ For example, the file `fileAttachment.pdf` contains a single attached file, whic
;= ...}
```

Note that `parse-extract-embedded` saves files to the temp dir returned by `java.io.tmpdir` (via https://github.com/Raynes/fs/).

If extraction fails, the functions will return the following:

``` clojure
Expand Down

0 comments on commit 80d7869

Please sign in to comment.