Skip to content

mgmeyers/pdfannots2json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdfannots2json

Extracts annotations from PDF and converts them to a JSON list.

Supported annotations:

  • highlight
  • strike
  • underline
  • text (also called notes)
  • rectangle
    • Note: rectangle annotations are exported as images

pdfannots2json uses UniPDF to extract annotations and MuPDF (Fitz) to extract images from PDFs.

Usage: pdfannots2json <input>

Arguments:
  <input>    Path to input PDF

Flags:
  -h, --help                          Show context-sensitive help.
  -v, --version                       Display the current version of pdf-annots2json
  -b, --ignore-before=TIME            Ignore annotations added before this date. Must be ISO 8601 formatted
  -w, --no-write                      Do not save images to disk
  -o, --image-output-path=STRING      Output path of image annotations
  -n, --image-base-name="annot"       Base name of saved images
  -f, --image-format="jpg"            Image format. Supports png and jpg
  -d, --image-dpi=120                 Image DPI
  -q, --image-quality=90              Image quality. Only applies to jpg images
  -e, --attempt-ocr                   Attempt to extract text from images. tesseract-ocr must be installed on your system
  -l, --ocr-lang="eng"                Set the OCR language. Supports multiple languages, eg. 'eng+deu'. The desired languages must be installed
      --tesseract-path="tesseract"    Absolute path to the tesseract executable
      --tess-data-dir=STRING          Absolute path to the tesseract data folder

Supported platforms (see releases)

  • Mac (intel, M1)
  • Linux (x64)
  • Windows (x64)

OCR

Using the --attempt-ocr flag instructs pdfannots2json to extract text from the images created by rectangle annotations. This requires that tesseract is installed on your system, including the appropriate language data (by default tesseract only support english). Tesseract can be installed from homebrew on mac, various linux package managers, and from here on windows: https://github.com/UB-Mannheim/tesseract/wiki Additional language files can be downloaded here: https://github.com/tesseract-ocr/tessdata (See here for a description of the language codes: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)

Sample output

[
  {
    "color": "#7fff7f",
    "colorCategory": "Green",
    "date": "2022-03-14T19:57:56Z",
    "id": "rectangle-p1x26y636",
    "imagePath": "/some/path/annot-1-x26-y636.jpg",
    "ocrText": "Fabio D’Antoni !-23-*®, Alessio Matiz 23, Franco Fabbro 2“ and Cristiano Crescentini 24© ",
    "page": 1,
    "type": "image",
    "x": 26.43,
    "y": 636.51
  },
  {
    "annotatedText": "Objectives: We explored the effects of a single 40-min session ",
    "color": "#ff7f7f",
    "colorCategory": "Red",
    "date": "2022-03-14T19:56:25Z",
    "id": "highlight-p1x205y514",
    "page": 1,
    "type": "highlight",
    "x": 205.42,
    "y": 514.86
  },
  {
    "annotatedText": "processing of distressing memories reported by a non-clinical sample of adult participants. Design: A within-subject design was used. Methods: Participants (n = 40 Psychologists/MDs) reported four distressing memories",
    "color": "#ffff7f",
    "colorCategory": "Yellow",
    "comment": "This is a highlight",
    "date": "2022-03-14T19:57:17Z",
    "id": "highlight-p1x166y462",
    "page": 1,
    "type": "highlight",
    "x": 166.02,
    "y": 462.99
  },
  {
    "annotatedText": "Post-Intervention, Follow-up. Results: SUD scores associated with EMDR, BSP, and BSM signifcantly decreased from Pre- to Post-Intervention (p \u003c 0.001). At Post-Intervention and Follow-up, EMDR and BSP SUD scores were signifcantly lower than BSM and BR scores (p \u003c 0.02). At both Post-Intervention a",
    "color": "#ffff7f",
    "colorCategory": "Yellow",
    "comment": "This is an underline",
    "date": "2022-03-14T19:57:11Z",
    "id": "underline-p1x166y385",
    "page": 1,
    "type": "underline",
    "x": 166.02,
    "y": 385.17
  },
  {
    "annotatedText": "Keywords: psychotherapy; distressing memories; EMDR; Brainspotting; body scan meditation; mindfulness; bottom-up therapy; body-oriented intervention; trauma; stress",
    "color": "#ff7fff",
    "colorCategory": "Magenta",
    "comment": "This is a strike",
    "date": "2022-03-14T19:57:07Z",
    "id": "strike-p1x166y295",
    "page": 1,
    "type": "strike",
    "x": 166.39,
    "y": 295.38
  },
  {
    "color": "#ff7fff",
    "colorCategory": "Magenta",
    "comment": "Hello",
    "date": "2022-03-14T19:57:00Z",
    "id": "text-p1x351y197",
    "page": 1,
    "type": "text",
    "x": 351.45,
    "y": 197.36
  },
  {
    "annotatedText": "In clinical contexts, however, distressing ",
    "color": "#7fff7f",
    "colorCategory": "Green",
    "date": "2022-03-14T19:57:23Z",
    "id": "highlight-p1x187y166",
    "page": 1,
    "type": "highlight",
    "x": 187.65,
    "y": 166.72
  },
  {
    "annotatedText": "and may originate from both “big trauma” (“T”), such as life-threatening experiences and sexual violence, o",
    "color": "#7fff7f",
    "colorCategory": "Green",
    "comment": "A green highlight",
    "date": "2022-03-14T19:57:32Z",
    "id": "highlight-p1x166y129",
    "page": 1,
    "type": "highlight",
    "x": 166.07,
    "y": 129.02
  }
]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published