Skip to content

Website scraper that gets all news links from harvard Gazette news site, parallel programmed, synchronised, threadsafe.

Notifications You must be signed in to change notification settings

Que-sar/harvardGazette-webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

harvardGazette-webscraper

University website scraper parallel programming project that gets all news links from harvard Gazette news site synchronised, in a threadsafe manner.

Presentation can be found in the repository.(5 mins read)

It uses BeautifulSoup4 and regex for gathering links, and MongoDB for uploading them to database, while taking commandline arguments with argparse.

1. Takes an endpoint, for eg.: "https://news.harvard.edu/gazette/story/2022/"

- 1 thread only, as its only 1 request, meaning its the peak performance.
- Gathers the years of logged publishments.

2. Starts specified Link-threads(default is the amount of years there are in publishment)

- These links gather all number of pages that contain links of new articles.
- Semaphore synchronises the next thread to start with task #3.

3. Starts link gathering threads.(default number of threads is the same as in task 2)

- Semaphore signals to start.
- Iterates through pages and gathers links.
- When 1 page finished, signals with semaphores to start threads with #4.
- Optimised thread mode: starts threads for each page, minimising IO based task time.

4. Starts link uploading thread to MongoDB server.

- Waits for semaphore to start.
- Starts uploading.
- Locks shared variable to make it threadsafe.
Time with optimised threading algorithms:
Average: 87.47159079138382 seconds
Median: 87.05949537330392 seconds
Time without multithreading:
Average: 1448.121027455036 seconds
Median: 1449.7646775171445 seconds

Requirements

The running of the program requires the:

  • argparser module, pip3 install argparse # parsing command line arguments
  • the pymongo module, pip3 install pymongo #connecting to database
  • regular expressions,
  • bs4(beautiful soup), pip3 install beautifulsoup4 #parsing HTML
  • queue,
  • requests,
  • date,
  • time,
  • threading modules.

Usage

with pyinstaller the main.py was made executable(main). This is located inside the main zip folder or the dist folder, tested and is working properly with the flags too.

example: ./main -y 1 -p 1 -u 1 Using 1 thread for every function More below:

if used with python, the main.py and download.py need to be in the same folder, or can be just copy pasted if needed.

example useage: python3 main.py -y 10 -u 17 -p 23 -o y -y flag for amount of yearThreads -u for amount of uploadThreads -p for amount of pageThreads -o for optimised page scraping strategy(WARNING) this creates a lot of threads

with the help of the -h flag, this is all displayed

About

Website scraper that gets all news links from harvard Gazette news site, parallel programmed, synchronised, threadsafe.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages