harvardGazette-webscraper

University website scraper parallel programming project that gets all news links from harvard Gazette news site synchronised, in a threadsafe manner.

Presentation can be found in the repository.(5 mins read)

It uses BeautifulSoup4 and regex for gathering links, and MongoDB for uploading them to database, while taking commandline arguments with argparse.

1. Takes an endpoint, for eg.: "https://news.harvard.edu/gazette/story/2022/"

- 1 thread only, as its only 1 request, meaning its the peak performance.
- Gathers the years of logged publishments.

2. Starts specified Link-threads(default is the amount of years there are in publishment)

- These links gather all number of pages that contain links of new articles.
- Semaphore synchronises the next thread to start with task #3.

3. Starts link gathering threads.(default number of threads is the same as in task 2)

- Semaphore signals to start.
- Iterates through pages and gathers links.
- When 1 page finished, signals with semaphores to start threads with #4.
- Optimised thread mode: starts threads for each page, minimising IO based task time.

4. Starts link uploading thread to MongoDB server.

- Waits for semaphore to start.
- Starts uploading.
- Locks shared variable to make it threadsafe.

Time with optimised threading algorithms:

Average: 87.47159079138382 seconds
Median: 87.05949537330392 seconds

Time without multithreading:

Average: 1448.121027455036 seconds
Median: 1449.7646775171445 seconds

Requirements

The running of the program requires the:

argparser module, pip3 install argparse # parsing command line arguments
the pymongo module, pip3 install pymongo #connecting to database
regular expressions,
bs4(beautiful soup), pip3 install beautifulsoup4 #parsing HTML
queue,
requests,
date,
time,
threading modules.

Usage

with pyinstaller the main.py was made executable(main). This is located inside the main zip folder or the dist folder, tested and is working properly with the flags too.

example: ./main -y 1 -p 1 -u 1 Using 1 thread for every function More below:

if used with python, the main.py and download.py need to be in the same folder, or can be just copy pasted if needed.

example useage: python3 main.py -y 10 -u 17 -p 23 -o y -y flag for amount of yearThreads -u for amount of uploadThreads -p for amount of pageThreads -o for optimised page scraping strategy(WARNING) this creates a lot of threads

with the help of the -h flag, this is all displayed

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
Presentation.pptx		Presentation.pptx
README.md		README.md
main		main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

harvardGazette-webscraper

Presentation can be found in the repository.(5 mins read)

1. Takes an endpoint, for eg.: "https://news.harvard.edu/gazette/story/2022/"

2. Starts specified Link-threads(default is the amount of years there are in publishment)

3. Starts link gathering threads.(default number of threads is the same as in task 2)

4. Starts link uploading thread to MongoDB server.

Time with optimised threading algorithms:

Time without multithreading:

Requirements

Usage

About

Releases

Packages

Languages

Que-sar/harvardGazette-webscraper

Folders and files

Latest commit

History

Repository files navigation

harvardGazette-webscraper

Presentation can be found in the repository.(5 mins read)

1. Takes an endpoint, for eg.: "https://news.harvard.edu/gazette/story/2022/"

2. Starts specified Link-threads(default is the amount of years there are in publishment)

3. Starts link gathering threads.(default number of threads is the same as in task 2)

4. Starts link uploading thread to MongoDB server.

Time with optimised threading algorithms:

Time without multithreading:

Requirements

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages