Skip to content

Latest commit

 

History

History
118 lines (74 loc) · 12.1 KB

README.md

File metadata and controls

118 lines (74 loc) · 12.1 KB

About wikipedia-word-frequency-clean

This project provides word frequency lists generated from cleaned-up Wikipedia dumps for several languages. The following table shows the number of words (types) for each language and mutation (2nd–5th column) with links to the list files, as well as the number of tokens and articles (6th and 7th column).

Language / Mutation no norm. no norm., lowercased NFKC norm. NFKC norm., lowercased #tokens #articles
Czechregex 866,635 772,788 866,619 772,771 137,564,164 832,967
Englishregex 2,419,333 2,162,061 2,419,123 2,161,820 2,489,387,103 16,699,990
EnglishPenn 2,988,260 2,709,385 2,988,187 2,709,302 2,445,526,919 16,699,990
Frenchregex 1,187,843 1,061,089 1,187,646 1,060,849 842,907,281 4,108,861
Germanregex 2,690,869 2,556,353 2,690,793 2,556,249 893,385,641 4,455,795
Italianregex 960,238 852,087 960,149 851,996 522,839,613 2,783,290
JapaneseUnidic Lite 549,745 522,590 549,358 522,210 610,467,200 2,177,257
JapaneseUnidic 3.1.0 561,212 535,726 560,821 535,341 609,365,356 2,177,257
Portugueseregex 668,333 580,948 668,262 580,862 300,324,703 1,852,956
Russianregex 2,069,646 1,854,875 2,069,575 1,854,793 535,032,557 4,483,522
Spanishregex 1,124,168 987,078 1,124,055 986,947 685,158,870 3,637,655
Chinesejieba, experimental 1,422,002 1,403,896 1,421,875 1,403,791 271,230,431 2,456,160
Indonesianregex 433,387 373,475 433,376 373,461 117,956,650 1,314,543

The word lists for all the above languages are generated from dumps dated 20 October 2022, with the exception of Indonesian, which is generated from a dump dated 1 August 2024.

Furthermore, the project provides a script for generating the lists that can be applied to other Wikipedia languages.

About the wordlist files

For each word, the files (linked in the table above) list:

  • number of occurrences,
  • number of documents (Wikipedia articles).

Words occurring in less than 3 articles are not included. The lists are sorted by the number of occurrences. The data is tab-separated with a header, and the file is compressed with LZMA2 (xz).

Important: The last row labeled [TOTAL] lists the total numbers of tokens and articles, and thus may require special handling. Also, note that the totals are not sums of the previous rows' values.

How is this different from wikipedia-word-frequency?

We strive for data that is cleaner (not containing spurious “words” such as br or colspan), and linguistically meaningful (correctly segmented, with consistent criteria for inclusion in the list). Here are the specific differences:

  1. Cleanup: We remove HTML/Wikitext tags such as (<br>, <ref>, etc.), table formatting (e.g. colspan, rowspan), some non-textual content (such as musical scores), placeholders for formulas and code (formula_…, codice_…) or ruby (furigana).

  2. Tokenization: We tokenize Japanese and Chinese, see About mutations. This is necessary because these languages do not separate words with spaces. (The wikipedia-word-frequency script simply extracts and counts any contiguous chunks of characters, which can range from a word to a whole sentence.)

    We tokenize other languages using a regular expression for orthographic words, consistently treating hyphen - and apostrophe ' as punctuation that cannot occur inside a word. (The wikipedia-word-frequency script allows these characters except start or end of word, thus allowing women's but excluding mens'. It also blindly converts en-dashes to hyphens, e.g. tokenizing New York–based as New and York-based, and right single quotation marks to apostrophes, resulting into further discrepancies.)

    For English, in addition to the default regex tokenization, we also provide the Penn Treebank tokenization (e.g. can't segmented as ca and n't). In this case, apostrophes are allowed, and we also do a smart conversion of right single quotation marks to apostrophes (to distinguish the intended apostrophe in can’t from the actual quotation mark in ‘tuna can’).

  3. Normalization: For all languages, we provide mutations that are lowercased and/or normalized to NFKC.

Additionally, the script for generating the wordlists supports multiprocessing (processing several dump files of the same language in parallel), greatly reducing the wall-clock time necessary to process the dumps.

About mutations

For each language we provide several mutations.

  • All languages have the following mutations distinguished by the filename suffixes:

    • ….tsv.xz: no normalization,
    • …-lower.tsv.xz: no normalization, lowercased,
    • …-nfkc.tsv.xz: NFKC normalization
    • …-nfkc-lower.tsv.xz: NFKC normalization, lowercased

    In addition to that, there are two variants of English and Japanese tokenization:

  • English:

    • regex tokenization (same as for Czech, French, etc.)
    • filename contains -penn: improved Penn Treebank tokenization from nltk.tokenize.word_tokenize
  • Japanese: We do the same processing and provide the same mutations for Japanese as in TUBELEX-JA:

    • Unidic Lite tokenization
    • filename contains -310: Unidic 3.1.0 tokenization

Chinese is tokenized using the jieba tokenizer. See Further work and similar lists for caveats about experimental Chinese support.

What is considered a word?

  1. English with Penn Treebank tokenization: Tokens that fulfil the following conditions:

    • do not contain digits
    • contains at least one word character (\w).

    E.g. a, o'clock, 'coz, pre-/post-, U.S.A., LGBTQ+, but not 42, R2D2, ... or ..

  2. Japanese and Chinese: Tokens that fulfil the following conditions:

    • do not contain digits (characters, such 一二三 are not considered digits),
    • start and end with a word character (\w), or wave dash () in case of Japanese (e.g. あ〜).
  3. Other languages and English with the default regex tokenization:

    • tokens that consist of word characters (\w) except digits.

The default regex tokenization considers all non-word characters (\W, i.e. not \w) word separators. Therefore, while in English with Penn Treebank tokenization some tokens (e.g. R2D2) are excluded, with the regex tokenization, tokens that would have to be excluded do not occur in the first place (e.g. R2D2 is tokenized as R and D).

Usage

  1. Install requirements:

    pip install -r requirements.txt

  2. Download and process dumps (default date and languages):

    zsh run.sh

    Alternatively, download and process dumps from specific date and languages:

    zsh run.sh 20221020 cs sk

The run.sh script also outputs the table in this readme.

For usage of the Python script for processing the dumps, see python word_frequency.py --help.

Further work and similar lists

The word lists contain only the surface forms of the words (segments). For many purposes, lemmas, POS, and other information would be more useful. We plan to add further processing later.

Support for Chinese is only experimental. Chinese is currently processed “as is” without any conversion, which means that it's a mix of traditional and simplified characters (and also of different varieties of Chinese used on the Chinese Wikipedia). We also do not filter vocabulary/script variants (e.g. -{zh-cn:域;zh-tw:體}- or -{A|zh-hans:用户; zh-hant:使用者}-), which has the side effect of increasing the occurrences of tokens such as zh, hans, etc. The word list may still be fine for some NLP applications.

We are using wikiextractor to extract plain text from Wikipedia dumps. Ideally, almost no cleanup would be necessary after using this tool, but there is actually a substantial amount of non-textual content such as maps, musical scores, tables, math formulas and random formatting that wikiextractor doesn't remove or removes in a haphazard fashion (see the issue on GitHub). We try to remove both the legit placeholders and markup and also the most common markup that ought to be filtered by wikiextractor but isn't. The results are still imperfect, but rather than extending the removal in this tool, it would be better to fix wikiextractor. Another option would be to use the Wikipedia Cirrus search dumps instead (see this issue and my comment). Note that both approaches have been used to get pretraining data for large language models.

In the current version we have added Indonesian from a later dump. We observed the string "https" among relatively high frequency words, which means that our cleanup is less effective for the current Wikipedia dumps.

You may also like TUBELEX-JA, a large word list based on Japanese subtitles for YouTube videos, which is processed in a similar way.