Skip to content

This repository is set-up to scrape wikipedia articles including tables. This code leverages the python libraries Wikipedia and Beautiful Soup to scrape Wikipedia page contents.

Notifications You must be signed in to change notification settings

royn5618/Corpus_Builder_Wikipedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Corpus Builder Wikipedia

This repository is set-up to scrape wikipedia articles including tables. This code leverages the python libraries Wikipedia and Beautiful Soup to scrape Wikipedia page contents.

How to use:

-topic=<str> -level=<int> -folder=<str>

-topic : The topic you want to be searched

-level: Level is explained below.

-folder : Storage Folder Name

What is Level?

This code searches the topic, gets related pages and scrapes them. At level 1, it stops here. If level is 2, it gets the links in these pages and scrapes them as well. Basically, level is the number of times embedded links in the Wikipedia pages will be scraped.

A sample scrape result is provided in the Sample folder.

Find the corpus here

About

This repository is set-up to scrape wikipedia articles including tables. This code leverages the python libraries Wikipedia and Beautiful Soup to scrape Wikipedia page contents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages