docs/usage

scannedb-ok -- A tool for extracting text from a PDF that even works for scanned
books.

Usage: scannedb-ok COMMAND
  scannedb-ok is a tool for extracting the text from a PDF and was written to
  work for scanned books from books.google.com. It was designed to always
  correctly extract the LINES of the text, to insert SPACES even in complicated
  cases like emphasis with spaced letters and to repair SYLLABLE DIVISION at
  line breaks. There are commands for extracting text, training an artificial
  neural network, printing statistics etc. Type "scannedb-ok COMMAND -h" for
  information about a command.

Available options:
  -h,--help                Show this help text

Available commands:
  text                     scannedb-ok text extracts the text from a PDF. There
                           are options for stripping of page headers and footers
                           in order to make the pure text ready for text mining
                           and NLP. There are two input formats, pdf and xml.
  words                    scannedb-ok words extracts a list of all non-divided
                           words from the document. The first an the last word
                           or word-part of a line is left from the list, since
                           they may be parts of word due to syllable division.
  trainSpacing             scannedb-ok trainSpacing trains an ANN for
                           inter-glyph spacing. This requires four input files:
                           TRAININGPDF and TRAININGTXT must contain exactly the
                           same text, once in PDF format (or some other format
                           conaining glyph descriptions, like pdfminers xml),
                           and once as plaintext with correct spaces. This pair
                           of files is used for training the neurons.
                           VALIDATIONPDF and VALIDATIONTXT have to be exactly
                           corresponding texts, too. They are used to validate
                           the trained network after each learning iteration and
                           the validation result is logged, so that the
                           progression can be observed. For a start, you can
                           simply reuse the first pair of files as validation
                           texts.
  nospaces                 scannedb-ok nospaces extracts the text from a PDF
                           without inserting spaces, i.e. it produces scriptura
                           continua.
  stats                    scannedb-ok stats shows statistics about the text.
  spacing                  scannedb-ok spacing shows the inter-glyph distances
                           for statistical analysis. This generates CSV output.
  glyphs                   scannedb-ok glyphs shows information about the glyphs
                           found in the document.

See also: https://github.com/lueck/scannedb-ok#readme