dict-dl is a tool from Dict2vec to download online lexical word definitions, and generate strong and weak pairs from these definitions. The dictionaries used to download definitions are :
To use scripts in this folder, you will need :
- python3
- numpy (python3 version)
To download online definitions, you first need a file containing a list of words (one per line). The script will download the definitions for each of these words. We provide an example with the file 1000-words.txt.
Then run download_definitions.txt
with the filename of the file containing the
list of words, like :
$ ./download_definitions.py 1000-words.txt
You can specify which POS (part of speech) to download, for example to only download the "noun" definitions of words, run :
$ ./download_definitions.py 1000-words.txt -pos NOUN
The different POS available are :
NOUN
VERB
ADJECTIVE
If no POS is given, all the definitions for each word will be downloaded.
This will write all the fetched definitions in the file
1000-words-definitions.txt (or 1000-words-definitions-noun.txt if you added the
optional flag -pos NOUN
). Each line will look like this :
XXX word def1 def2 def3 def4 def5 def6
^ ^ \___________________________/
1 2 3
1: code representing the dictionary the definition comes from.
Cam -> Cambridge
Dic -> dictionary.com
Oxf -> Oxford
Col -> Collins
2: the fetched word.
3: words from the definition. Stopwords have been removed. All words
are lowercased.
It is possible that you see :
ERROR: * timeout error.
* retry Collins - natural
during the execution of the script. This means the server was not able to respond to the HTTP request, so we can not have the definition for this word. The error message tells you the word and the dictionary that were faulty, so you can redownload the definition manually. This error happens rarely and is only dependent on the web server load.
Fetched definitions are not regrouped and can contain words that are not in the vocabulary. To clean the definitions (remove words we don't want and regroup the 4 definitions together for each word), run :
$ ./clean_definitions.py -d 1000-words-definitions.txt -v vocab.txt
-d, --definitions FILE File containing the fetched definitions from
previous script
-v, --vocab FILE Remove all words that are not in this
vocabulary file
This will produce a file named all-definitions-cleaned.txt where the first word of each line is the fetched word, and the rest of the line are the words from all its definitions.
To generate strong and weak pairs, you need a pre-trained word embeddings to compute the K closest neighbours (used to form additional strong pairs). Then run :
$ ./generate_pairs.py -d all-definitions-cleaned.txt -e vectors.vec -K 5
-d, --definitions FILE File containing word definitions
-e, --embedding FILE File containing the word embeddings used to
compute the K closest neighbours
-K NUMBER Number of artificially generated strong pairs
for each real strong pairs
-sf, --strong-file FILE Output filename for saving strong pairs
-wf, --weak-file FILE Output filename for saving weak pairs