datecollected vs eventDate #1

mgaynor1 · 2024-01-17T17:10:36Z

Currently, ridigbio returns datecollected by default, which we do not recommend to be used in scientific research. When a data provider does not provide a full date in the Darwin Core eventDate field, this complete value or the missing parts (i.e., month and/or day) are randomly generated and thus may lack any real meaning. The generated dates are difficult to detect, as they are randomly distributed. We are currently working to modify our ingestion pipeline to avoid randomly generating dates. However, dates remain an issue across biodiversity aggregators and the solution is not clear (see GBIF for example).

Why does this matter for SeedPhenology?
I found that datecollected is used by this repository as if it was a real value. This value is then transformed to day of year (doy), which may be misleading and it may be difficult to detect when artificial dates were transformed.

What to do instead?
We plan to update the ridigbio package to instead return "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day" - which are all text fields, rather than dates. These fields are not randomly generated, instead the values are directly from data providers therefore they may provide meaning in biological research. See current issue and pull request.

From your code, I believe you all want scientificname, lat/lon, collector, UUID, and date. To obtain these fields, this is how you need to modify the download:

fields2getrecord <- c("data.dwc:scientificName",  
                                     "data.dwc:decimalLatitude",   
                                     "data.dwc:decimalLongitude",
                                     "collector"
                                     "uuid", 
                                     "data.dwc:eventDate", 
                                     "data.dwc:year", 
                                     "data.dwc:month", 
                                     "data.dwc:day" )

 text_result <- ridigbio::idig_search(rq = list(scientificname = x), fields = fields2getrecord)) ...

For the media pull, I think you want just the accessuri and the UUID:

fields2getmedia <- c("accessuri", "uuid")
media_result <- ridigbio::idig_search_media(
    rq = list(scientificname = x), mq = TRUE, fields = fields2getmedia) |> ...

Additional modification to your function script will also be needed since the date downloaded here will not be in date format - instead, all dates will be text strings. There are many ways to convert these to dates, for example, see gatoRs remove_duplicate function or ridigbio proposed solution here.

Hope this helps and please let me know if you have any questions or want more specific code suggestions.

The text was updated successfully, but these errors were encountered:

mgaynor1 · 2024-01-17T17:51:48Z

I also wanted to let you know that research query (rq) based on scientificname will only return exact matches. The exact match search may lead to missing records, which may or may not be important for your research. Here is an example of three ways to search iDigBio based on a scientific name to get around this exact match: https://gist.github.com/mgaynor1/8231646b6614c29d384a387ce6731e7b

mgaynor1 · 2024-01-17T18:08:56Z

I saw your comment on the gist and realized the above code was incorrect! This worked for me:

fields2getrecord <- c("data.dwc:scientificName",  
                      "data.dwc:decimalLatitude",   
                      "data.dwc:decimalLongitude",
                      "collector",
                      "uuid", 
                      "data.dwc:eventDate", 
                      "data.dwc:year", 
                      "data.dwc:month", 
                      "data.dwc:day" )

text_result <- ridigbio::idig_search(
  rq = list(scientificname = 'Abronia villosa'), 
  fields = fields2getrecord) 

fields2getmedia <- c("accessuri", "records")
media_result <- ridigbio::idig_search_media(
  rq = list(scientificname = 'Abronia villosa'), mq = TRUE, fields = fields2getmedia) 


media_result$records <- as.character(media_result$records)
joined <-  dplyr::inner_join(text_result, media_result, by = c('uuid' = 'records')

sagesteppe · 2024-01-18T15:50:26Z

Hi Shelly,

THANK YOU SO MUCH, for pointing this out! Sorry for the delay in response here, I tried to get back to you before a meeting yesterday, and then...

I have updated that function to 1) only return desired fields (thanks!), and 2) assemble the known collection date from year, month, and day.

Regarding the join to the media results, usually I can find a field to join on pretty quickly. But it took me a decent amount of time to realize that 'result' was the field to use. Could this relationship be added to the documentation? 99% of the reason I have started to use ridigbio is because the URLS to the scanned sheets means I can easily see what's going on, which obviously has many uses beyond scoring.

In regards to sheets, some URLS default to download the image, and some URLS don't want to open at all (unless you remove this). I cannot find the rhyme nor reason for the default downloads, but can pull up some examples of the behavior if you are interested? I didn't think it was worth posting about over there...

mgaynor1 · 2024-01-18T16:02:16Z

Thank you for figuring out the correct field to use! I am sadly not in charge of documentation, but I will open an issue on the ridigbio package with this information to be hopefully added. By 'result' do you mean 'record'? Your code still includes the inner_join where 'uuid' = 'record'.

For the URLS to download, I believe iDigBio stopped accepting images in 12/2020, so some URLS may direct to iDigBio stored images and others may link to the data providers (aka the collections). This is just a guess, please feel free to send me examples of media records when URLS go direct to download or do not open. I can check in with the iDigBio team and see if this theory is correct for those that are direct downloads. For those URLs that are not working at all, definitely send those along.

sagesteppe · 2024-01-18T19:23:45Z

Yes, I meant 'record' ; I haven't see the sun in a few days and it's starting to take it's toll.

mgaynor1 mentioned this issue Jan 17, 2024

Pending update to ridigbio will break spocc! ropensci/spocc#263

Closed

sagesteppe added a commit that referenced this issue Jan 18, 2024

Fix bug GH-1

3c5171a

mgaynor1 mentioned this issue Jan 18, 2024

Documentation on Media records and Records relationship is needed iDigBio/ridigbio#47

Closed

sagesteppe closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datecollected vs eventDate #1

datecollected vs eventDate #1

mgaynor1 commented Jan 17, 2024 •

edited

Loading

mgaynor1 commented Jan 17, 2024

mgaynor1 commented Jan 17, 2024 •

edited

Loading

sagesteppe commented Jan 18, 2024

mgaynor1 commented Jan 18, 2024

sagesteppe commented Jan 18, 2024

datecollected vs eventDate #1

datecollected vs eventDate #1

Comments

mgaynor1 commented Jan 17, 2024 • edited Loading

mgaynor1 commented Jan 17, 2024

mgaynor1 commented Jan 17, 2024 • edited Loading

sagesteppe commented Jan 18, 2024

mgaynor1 commented Jan 18, 2024

sagesteppe commented Jan 18, 2024

mgaynor1 commented Jan 17, 2024 •

edited

Loading

mgaynor1 commented Jan 17, 2024 •

edited

Loading