Skip to content

Commit

Permalink
Use the correct way to pull Spans into a seq, allowing capture of mul…
Browse files Browse the repository at this point in the history
…ti-word names
  • Loading branch information
dakrone committed Feb 12, 2012
1 parent 493ab08 commit 887add2
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 3 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ docs/*
parser-model/en-parser-chunking.bin
.lein-failures
multi-lib/*
.lein-deps-sum
5 changes: 3 additions & 2 deletions src/opennlp/nlp.clj
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
Detokenizer$DetokenizationOperation
DictionaryDetokenizer
TokenizerME
TokenizerModel)))
TokenizerModel)
(opennlp.tools.util Span)))

;; OpenNLP property for pos-tagging. Meant to be rebound before
;; calling the tagging creators
Expand Down Expand Up @@ -108,7 +109,7 @@
matches (.find finder (into-array String tokens))
probs (seq (.probs finder))]
(with-meta
(distinct (map #(get tokens (.getStart %)) matches))
(distinct (Span/spansToStrings matches (into-array String tokens)))
{:probabilities probs}))))

(defmulti make-detokenizer
Expand Down
5 changes: 4 additions & 1 deletion test/opennlp/test/nlp.clj
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,10 @@
(is (= (name-find (tokenize "My name is Lee, not John"))
'("Lee" "John")))
(is (= (name-find ["adsf"])
'())))
'()))
(is (= (name-find (tokenize "My name is James Brown"))
'("James Brown"))
"should find names with two words"))

(deftest detokenizer-test
(is (= (detokenize (tokenize "I don't think he would've."))
Expand Down

0 comments on commit 887add2

Please sign in to comment.