Corpus Builder Wikipedia

This repository is set-up to scrape wikipedia articles including tables. This code leverages the python libraries Wikipedia and Beautiful Soup to scrape Wikipedia page contents.

How to use:

-topic=<str> -level=<int> -folder=<str>

-topic : The topic you want to be searched

-level: Level is explained below.

-folder : Storage Folder Name

What is Level?

This code searches the topic, gets related pages and scrapes them. At level 1, it stops here. If level is 2, it gets the links in these pages and scrapes them as well. Basically, level is the number of times embedded links in the Wikipedia pages will be scraped.

A sample scrape result is provided in the Sample folder.

Find the corpus here

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Sample_Folder		Sample_Folder
.gitignore		.gitignore
IndiaPolitics.zip		IndiaPolitics.zip
README.md		README.md
make_corpus.py		make_corpus.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpus Builder Wikipedia

How to use:

What is Level?

About

Releases

Packages

Languages

royn5618/Corpus_Builder_Wikipedia

Folders and files

Latest commit

History

Repository files navigation

Corpus Builder Wikipedia

How to use:

What is Level?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages