Scrape plants scientific name information from the internet.
Current supported sources:
- python >= 3.10 (you can use pyenv for easier python version management)
- pipenv
Detailed Guide for Windows
- Download python from https://www.python.org/downloads/
- Install python, follow the instruction
- Press Win button (something like window icon on keyboard), search "env", then open
Edit the system environment variables
- Click Environment Variables
- On
System Variables
section, edit thePath
key - Add these paths using the
New
button:# Please replace the username with your windows username, you can see it in C:\Users folder # Please replace the python version with your installed python version C:\Users\<YOUR_USERNAME>\AppData\Local\Programs\Python\Python310 C:\Users\<YOUR_USERNAME>\AppData\Local\Programs\Python\Python310\Scripts C:\Users\<YOUR_USERNAME>\AppData\Roaming\Python\Python310\Scripts
- Click OK, then OK
- Open cmd, then type
python --version
, then it should respond with the python version. - Type
pip3 install --user pipenv
, then it should install pipenv, make sure it's successfully installed. - Type
pipenv --version
, then it should respond with the pipenv version. - Done! You can continue follow the guide in the "How to run" section.
- Clone
git clone [email protected]:rizqirizqi/scientific-name-scraper.git cd scientific-name-scraper
- Install dependencies
pipenv --python 3 pipenv install
- Fill your input in
input.csv
, please look atsamples/input.csv
for example. You can also use txt or xlsx if you want. - Run
pipenv run python -m sciscraper -i input.csv
- The result will be placed in a file named
result.*.csv
pipenv run python -m sciscraper --help
pipenv run scrapy shell <URL>
# Switchboard Example
pipenv run scrapy shell 'http://apps.worldagroforestry.org/products/switchboard/index.php/species_search/Acacia%20abyssinica'
# WFO Example
pipenrun scrapy shell 'http://www.worldfloraonline.org/search?query=Costus+speciosus&view=&limit=5&start=0&sort='
result = response.css("#v results > table tr")[0]
data_col = result.css("td:nth-child(2)")
rm result.* && rm log.*
Case | Link | Note |
---|---|---|
ICRAF Database Not Found | Engelhardia spicata | Need human to check ✔ |
Genus Found | Forficula | Need human to check ✔ |
Multiple Species Found | Alstonia spectabilis | Get the matched substring of the species ✔ |
Similar Species Found | Costus speciosus | Need human to check ✔ |
Similar Species Found: variant | Engelhardtia spicata | Get the exact match ✔ |
Similar Species Found: subsp / ssp | Ailanthus integrifolia | Get the species ✔ |
Similar Species Found: double space | Anacardium occidentale | Get the exact match ✔ |
Duplicate Link Found | Intsia bijuga | Need human to check ✔ |
External Link Found | Elaeocarpus petiolatus | Remove the link ✔ |
- Fork this repo
- Develop
- Create pull request
- Tag @rizqirizqi for review
- Merge~~
MIT