forked from microsoft/CameraTraps
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of github.com:microsoft/CameraTraps into main
- Loading branch information
Showing
16 changed files
with
1,693 additions
and
227 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,25 @@ | ||
We generated a list of all the annotations in our universe; the scripts in this folder were used to (interactively) map them onto the GBIF and iNat taxonomies. | ||
## Mapping labels to a standard taxonomy (usually for new LILA datasets) | ||
|
||
When a new .json file comes in and needs to be mapped to scientific names... | ||
|
||
## Creating the Taxonomy CSV | ||
* Assuming this is a LILA dataset, edit the [LILA metadata file](http://lila.science/wp-content/uploads/2020/03/lila_sas_urls.txt) to include the new .json and dataset name. | ||
|
||
Creating the taxonomy CSV file requires running 3 scripts. | ||
* Assuming this is a LILA dataset, use get_lila_category_list.py to download the .json files for every LILA dataset. This will produced a .json-formatted dictionary mapping each dataset to all of the categories it contains. | ||
|
||
1. Generate a spreadsheet of the class names within each desired dataset by querying MegaDB. These class names are the names provided directly by our partner organizations and may include abbreviations, e.g., "wtd" meaning "white-tailed deer." | ||
* Use map_new_lila_datasets.py to create a .csv file mapping each of those categories to a scientific name and taxonomy. This will eventually become a subset of rows in the "master" .csv file. This is a semi-automated process; it will look up common names against the iNat and GBIF taxonomies, with some heuristics to avoid simple problems (like making sure that "greater_kudu" matches "greater kudu", or that "black backed jackal" matches "black-backed jackal"), but you will need to fill in a few gaps manually. I do this with three windows open: a .csv editor, Spyder (with the cell called "manual lookup" from this script open), and a browser. Once you generate this .csv file, it's considered permanent, i.e., the cell that wrote it won't re-write it, so manually edit to your heart's content. | ||
|
||
This is done by running the `taxonomy_mapping/species_by_dataset.py` script. The first time running this step may take a while. However, intermediary outputs are cached in JSON files for much faster future runs. | ||
* Use preview_lila_taxonomy.py to produce an HTML file full of images that you can use to make sure that the matches were sensible; be particularly suspicious of anything that doesn't look like a mammal, bird, or reptile. Go back and fix things in the .csv file. This script/notebook also does a bunch of other consistency checking. | ||
|
||
2. Because each partner organization uses their own naming scheme, we need to map the class names onto a common taxonomy. We use a combination of the [iNaturalist taxonomy](https://forum.inaturalist.org/t/how-to-download-taxa/3542) and the [Global Biodiversity Information Facility (GBIF) Backbone Taxonomy](https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c). | ||
* When you are totally satisfied with that .csv file, manually append it to the "master" .csv file (lila-taxonomy-mapping.csv), which is currently in a private repository. preview_lila_taxonomy can also be run against the master file. | ||
|
||
This is done by running the `taxonomy_mapping/process_species_by_dataset.py` script. Note that this script is not meant to be run as a normal Python script but is instead intended to be run interactively. | ||
* Check for errors (one more time) (this should be redundant with what's now included in preview_lila_taxonomy.py, but it can't hurt) by running: | ||
|
||
3. Once the taxonomy CSV is generated, check for errors by running | ||
```bash | ||
python taxonomy_mapping/taxonomy_csv_checker.py /path/to/taxonomy.csv | ||
``` | ||
|
||
```bash | ||
python taxonomy_mapping/taxonomy_csv_checker.py /path/to/taxonomy.csv | ||
``` | ||
* Prepare the "release" taxonomy file (which removes a couple columns and removes unused rows) using prepare_lila_taxonomy_release.py . | ||
|
||
* Use map_lila_categories.py to get a mapping of every LILA data set to the common taxonomy. | ||
|
||
## Visualize the Taxonomy Hierarchy | ||
|
||
The `visualize_taxonomy.ipynb` notebook demonstrates how to visualize the taxonomy hierarchy. It requires the *networkx* and *graphviz* Python packages. | ||
* The `visualize_taxonomy.ipynb` notebook demonstrates how to visualize the taxonomy hierarchy. It requires the *networkx* and *graphviz* Python packages. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# | ||
# Using the taxonomy .csv file, map all LILA datasets to the standard taxonomy | ||
# | ||
|
||
#%% Constants and imports | ||
|
||
import json | ||
import os | ||
|
||
import pandas as pd | ||
|
||
# Created by get_lila_category_list.py... contains counts for each category | ||
lila_dataset_to_categories_file = r"G:\temp\lila\lila_categories_list\lila_dataset_to_categories.json" | ||
lila_taxonomy_file = r"G:\temp\lila\lila-taxonomy-mapping_release.22.07.03.1608.csv" | ||
|
||
assert os.path.isfile(lila_dataset_to_categories_file) | ||
assert os.path.isfile(lila_taxonomy_file) | ||
|
||
|
||
#%% Load category and taxonomy files | ||
|
||
with open(lila_dataset_to_categories_file,'r') as f: | ||
lila_dataset_to_categories = json.load(f) | ||
|
||
taxonomy_df = pd.read_csv(lila_taxonomy_file) | ||
|
||
|
||
#%% Map dataset names and category names to scientific names | ||
|
||
ds_query_to_scientific_name = {} | ||
|
||
unmapped_queries = set() | ||
|
||
# i_row = 1; row = taxonomy_df.iloc[i_row]; row | ||
for i_row,row in taxonomy_df.iterrows(): | ||
|
||
ds_query = row['dataset_name'] + ':' + row['query'] | ||
ds_query = ds_query.lower() | ||
|
||
if not isinstance(row['scientific_name'],str): | ||
unmapped_queries.add(ds_query) | ||
ds_query_to_scientific_name[ds_query] = 'unmapped' | ||
continue | ||
|
||
ds_query_to_scientific_name[ds_query] = row['scientific_name'] | ||
|
||
|
||
#%% For each dataset, make sure we can map every category to the taxonomy | ||
|
||
# dataset_name = list(lila_dataset_to_categories.keys())[0] | ||
for _dataset_name in lila_dataset_to_categories.keys(): | ||
|
||
if '_bbox' in _dataset_name: | ||
dataset_name = _dataset_name.replace('_bbox','') | ||
else: | ||
dataset_name = _dataset_name | ||
|
||
categories = lila_dataset_to_categories[dataset_name] | ||
|
||
# c = categories[0] | ||
for c in categories: | ||
ds_query = dataset_name + ':' + c['name'] | ||
ds_query = ds_query.lower() | ||
|
||
if ds_query not in ds_query_to_scientific_name: | ||
print('Could not find mapping for {}'.format(ds_query)) | ||
else: | ||
scientific_name = ds_query_to_scientific_name[ds_query] | ||
|
Oops, something went wrong.