It uses BeautifulSoup4 and regex for gathering links, and MongoDB for uploading them to database, while taking commandline arguments with argparse.
1. Takes an endpoint, for eg.: ""
- Gathers the years of logged publishments.
- These links gather all number of pages that contain links of new articles.
- Semaphore synchronises the next thread to start with task #3.
- Semaphore signals to start.
- Iterates through pages and gathers links.
- When 1 page finished, signals with semaphores to start threads with #4.
- Optimised thread mode: starts threads for each page, minimising IO based task time.
- Waits for semaphore to start.
- Starts uploading.
- Locks shared variable to make it threadsafe.
Average: 87.47159079138382 seconds
Median: 87.05949537330392 seconds
Average: 1448.121027455036 seconds
Median: 1449.7646775171445 seconds
The running of the program requires the:
- argparser module, pip3 install argparse # parsing command line arguments
- the pymongo module, pip3 install pymongo #connecting to database
- regular expressions,
- bs4(beautiful soup), pip3 install beautifulsoup4 #parsing HTML
- queue,
- requests,
- date,
- time,
- threading modules.
with pyinstaller the was made executable(main). This is located inside the main zip folder or the dist folder, tested and is working properly with the flags too.
example: ./main -y 1 -p 1 -u 1 Using 1 thread for every function More below:
if used with python, the and need to be in the same folder, or can be just copy pasted if needed.
example useage: python3 -y 10 -u 17 -p 23 -o y -y flag for amount of yearThreads -u for amount of uploadThreads -p for amount of pageThreads -o for optimised page scraping strategy(WARNING) this creates a lot of threads
with the help of the -h flag, this is all displayed