-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusage
48 lines (44 loc) · 2.92 KB
/
usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
scannedb-ok -- A tool for extracting text from a PDF that even works for scanned
books.
Usage: scannedb-ok COMMAND
scannedb-ok is a tool for extracting the text from a PDF and was written to
work for scanned books from books.google.com. It was designed to always
correctly extract the LINES of the text, to insert SPACES even in complicated
cases like emphasis with spaced letters and to repair SYLLABLE DIVISION at
line breaks. There are commands for extracting text, training an artificial
neural network, printing statistics etc. Type "scannedb-ok COMMAND -h" for
information about a command.
Available options:
-h,--help Show this help text
Available commands:
text scannedb-ok text extracts the text from a PDF. There
are options for stripping of page headers and footers
in order to make the pure text ready for text mining
and NLP. There are two input formats, pdf and xml.
words scannedb-ok words extracts a list of all non-divided
words from the document. The first an the last word
or word-part of a line is left from the list, since
they may be parts of word due to syllable division.
trainSpacing scannedb-ok trainSpacing trains an ANN for
inter-glyph spacing. This requires four input files:
TRAININGPDF and TRAININGTXT must contain exactly the
same text, once in PDF format (or some other format
conaining glyph descriptions, like pdfminers xml),
and once as plaintext with correct spaces. This pair
of files is used for training the neurons.
VALIDATIONPDF and VALIDATIONTXT have to be exactly
corresponding texts, too. They are used to validate
the trained network after each learning iteration and
the validation result is logged, so that the
progression can be observed. For a start, you can
simply reuse the first pair of files as validation
texts.
nospaces scannedb-ok nospaces extracts the text from a PDF
without inserting spaces, i.e. it produces scriptura
continua.
stats scannedb-ok stats shows statistics about the text.
spacing scannedb-ok spacing shows the inter-glyph distances
for statistical analysis. This generates CSV output.
glyphs scannedb-ok glyphs shows information about the glyphs
found in the document.
See also: https://github.com/lueck/scannedb-ok#readme