This web scraper uses Scrapy with Python to scrape all updates posted in select government websites regarding COVID-19 (selenium was used to scrape links to New Zealand's website)
This scraper uses scrapy,CLD-2, dateparser, and html2text as dependencies. I am also using Python3 to create a virtual environment to create an isoalted environment to run the scraper on.
- Create a virtualenv and run it. (This is slightly different for Windows vs Linux/Mac)
- In order for selenium to work, you need to download
chromewebdriver
and place it into the directory containing the shell scripts. Choose the version that matches the web browser you have. Note: You can always opt to use a different browser like Firefox. Just make sure to change the code accordingly innew_zealand_links,py
to reflect that. Also, if you don't have a windows machine, you need to change this part innew_zealand_links.py
to reflect that:
CHROMEDRIVER_PATH = './chromedriver.exe'
- Run
pip install -r requirements.txt
from the inside the directory containingrequirements.txt
file while virtualenv is running to install all the dependencies
While inside the virtualenv cd
into the directory that contains powershell_script.ps1
and run .\powershell_script.ps1
while passing allowed arguments, from powershell terminal to run the script. For example, running .\powershell_script.ps1 cdc
will fetch covid-19 related posts from the CDC website. The list of allowed options can be found in the bottom of this document.
While inside the virtualenv cd
into the directory that contains unix_script.sh
and run bash unix_script.sh
while passing allowed arguments, from shell terminal to run the script. For example, running bash unix_script.sh cdc
will fetch covid-19 related posts from the CDC website. The list of allowed options can be found in the bottom of this document.
The scraped posts are saved in posts
directory in the format {title,source,published,url,scraped,classes,country,municipality,language,text}
for each post. The links to each update are saved in links
directory.
- All (Run all scrapers)
Note: Since all the passed arguments are converted into lowercase, casing doesn't matter when you are passing it in the shell. For example: .\powershell_script.ps1 cDc
would work the same way as .\powershell_script.ps1 CDC
- Since the addition to
posts
are appended on instead of overwritten, all the contents of or the whole directory -posts
must be deleted before each run (except the first run sinceposts
directory does not exist yet during the first run). If this step is not takenposts
WILL HAVE incorrect data - DO NOT delete the files in
links
directory even though it is safe to delete the contents of the files themselves - Since the log settings has been set to
INFO
only information will be displayed during runs. If an error is encounterd and the link trying to be scraped hasdownloads
or.pdf
on it somewhere, the error message can be ignored. There might also be a404
response sometimes anddateparser errors
which should be ignored on a case-by-case basis - While in virtualenv run
deactivate
to stop and exit the virtual envrionment - Source code for scraper can be found in
spiders
directory new_zealand_links.py
is located in a separated in a different directory callednew_zealand_links
in the root directory because that scraper usesselenium
. The reason for not putting the file with all the other scrapers inside thespiders
directory is because how Scrapy works is that it pre-compiles(checks) the python scripts in that directory everytime before you callscrapy
. Meaning if the thenew_zealand_links.py
is placed inisde thespiders
directory, the file will be run everytime before callingscrapy
from the shell. For example, if you runscrapy crawl cdc_links
thenew_zealand_links.py
will still be run before the cdc_links scraper is run. This is specially problematic if you use the script to run all scrapers. This change is also reflected on the scripts.