Scrape Fandom

Fandom.com provides Wiki dumps at https://*.fandom.com/wiki/Special:Statistics, but most of the dumps are outdated, and require contacting an admin to produce a new dump.

This script scrapes Fandom.com for an updated Wiki dump. It scrapes the Special:AllPages to get a list of article names and requests a wiki dump from Special:Export. Instructions to get a corpus for natural language processing and training is provided.

Works only for English fandom sites. Some slight modifications are needed for other languages.

Notes

Will require the Chrome browser to be installed on the machine. The most up-to-date Chrome Driver will be handled by webdriver-manager. The requirements.txt file should list all Python libraries that your notebooks depend on, and they will be installed using:

pip install -r requirements.txt

Instructions

Clone the extractor locally (https://github.com/JOHW85/wikiextractor) with git clone https://github.com/JOHW85/wikiextractor
Open the terminal and cd your way to the repo dir: cd wikiextractor
Run python3 setup.py install
Finally, run run-me.sh FANDOM1 FANDOM2 in the terminal to get FANDOM1.jsonl and FANDOM2.jsonl in the directory.

Example run-me.sh harrypotter finalfantasy

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
src		src
.gitignore		.gitignore
README.md		README.md
ScrapeFandom.py		ScrapeFandom.py
json2jsonl.py		json2jsonl.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run-me.sh		run-me.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape Fandom

Notes

Instructions

About

Releases

Packages

Languages

SamHollings/ScrapeFandom

Folders and files

Latest commit

History

Repository files navigation

Scrape Fandom

Notes

Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages