-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datecollected vs eventDate #1
Comments
I also wanted to let you know that research query (rq) based on scientificname will only return exact matches. The exact match search may lead to missing records, which may or may not be important for your research. Here is an example of three ways to search iDigBio based on a scientific name to get around this exact match: https://gist.github.com/mgaynor1/8231646b6614c29d384a387ce6731e7b |
I saw your comment on the gist and realized the above code was incorrect! This worked for me:
|
Hi Shelly, THANK YOU SO MUCH, for pointing this out! Sorry for the delay in response here, I tried to get back to you before a meeting yesterday, and then... I have updated that function to 1) only return desired fields (thanks!), and 2) assemble the known collection date from year, month, and day. Regarding the join to the media results, usually I can find a field to join on pretty quickly. But it took me a decent amount of time to realize that 'result' was the field to use. Could this relationship be added to the documentation? 99% of the reason I have started to use ridigbio is because the URLS to the scanned sheets means I can easily see what's going on, which obviously has many uses beyond scoring. In regards to sheets, some URLS default to download the image, and some URLS don't want to open at all (unless you remove this). I cannot find the rhyme nor reason for the default downloads, but can pull up some examples of the behavior if you are interested? I didn't think it was worth posting about over there... |
Thank you for figuring out the correct field to use! I am sadly not in charge of documentation, but I will open an issue on the ridigbio package with this information to be hopefully added. By 'result' do you mean 'record'? Your code still includes the inner_join where 'uuid' = 'record'. For the URLS to download, I believe iDigBio stopped accepting images in 12/2020, so some URLS may direct to iDigBio stored images and others may link to the data providers (aka the collections). This is just a guess, please feel free to send me examples of media records when URLS go direct to download or do not open. I can check in with the iDigBio team and see if this theory is correct for those that are direct downloads. For those URLs that are not working at all, definitely send those along. |
Yes, I meant 'record' ; I haven't see the sun in a few days and it's starting to take it's toll. |
Currently, ridigbio returns datecollected by default, which we do not recommend to be used in scientific research. When a data provider does not provide a full date in the Darwin Core eventDate field, this complete value or the missing parts (i.e., month and/or day) are randomly generated and thus may lack any real meaning. The generated dates are difficult to detect, as they are randomly distributed. We are currently working to modify our ingestion pipeline to avoid randomly generating dates. However, dates remain an issue across biodiversity aggregators and the solution is not clear (see GBIF for example).
Why does this matter for SeedPhenology?
I found that datecollected is used by this repository as if it was a real value. This value is then transformed to day of year (doy), which may be misleading and it may be difficult to detect when artificial dates were transformed.
What to do instead?
We plan to update the ridigbio package to instead return "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day" - which are all text fields, rather than dates. These fields are not randomly generated, instead the values are directly from data providers therefore they may provide meaning in biological research. See current issue and pull request.
From your code, I believe you all want scientificname, lat/lon, collector, UUID, and date. To obtain these fields, this is how you need to modify the download:
For the media pull, I think you want just the accessuri and the UUID:
Additional modification to your function script will also be needed since the date downloaded here will not be in date format - instead, all dates will be text strings. There are many ways to convert these to dates, for example, see gatoRs remove_duplicate function or ridigbio proposed solution here.
Hope this helps and please let me know if you have any questions or want more specific code suggestions.
The text was updated successfully, but these errors were encountered: