Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date returned #44

Open
mgaynor1 opened this issue Jan 9, 2024 · 4 comments
Open

Date returned #44

mgaynor1 opened this issue Jan 9, 2024 · 4 comments

Comments

@mgaynor1
Copy link
Collaborator

mgaynor1 commented Jan 9, 2024

This function currently returns "datecollected", which is a modified field and could lack biological meaning. Date instead should be returned as the following fields: "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day".

"datecollected",

When this is modified, someone should reach out to spocc. They will need to update multiple scripts including:

https://github.com/ropensci/spocc/blob/59f6b3b192cd8a7bb990aab94748f3bc7b044dac/R/plugin_helpers.R#L34
https://github.com/ropensci/spocc/blob/59f6b3b192cd8a7bb990aab94748f3bc7b044dac/R/occ2df.R#L104
https://github.com/ropensci/spocc/blob/59f6b3b192cd8a7bb990aab94748f3bc7b044dac/R/plugins.r#L266

@jbennettufl
Copy link
Contributor

After our last internal meeting it was determined that we will be making efforts to update datecollected such that if there is no month or day when creating datecollected it will be set to the first day or the first month. This is such that if only dwc:year has a value like "1984" then the datecollected would become 1984-01-01. Since this is an ongoing effort related to: iDigBio/idb-backend#229 I don't think it would be necessary to fix it in both places because making a change in the R client would be unnecessary when the data itself eventually does successfully represent the darwin core fields. There is also the issue of breaking backwards compatibility and implementation details to address. So for instance, if we were attempt to update this field it does take some considerable overhead and there are still some side effects but the same thing can be achieved with the following code:

library(flipTime)
library("ridigbio")

DATECORRECTED_FIELDS <- c("uuid",
                          "occurrenceid",
                          "catalognumber",
                          "family",
                          "genus",
                          "scientificname",
                          "country",
                          "stateprovince",
                          "geopoint",
                          "data.dwc:eventDate",
                          "data.dwc:year",
                          "data.dwc:month",
                          "data.dwc:day",
                          "collector",
                          "recordset")

df <- idig_search_records(rq = rq, fields = DATECORRECTED_FIELDS, limit = 6000)

df <- within(df, datecollected <- as.Date("1970-01-01"))

for (i in seq_along(df$`data.dwc:eventDate`)) {
  if (!is.na(df$`data.dwc:eventDate`[i])) {
    #contains a slash, take the date to the left
    if ("/" %in% df$`data.dwc:eventDate`[i]) {
      date_range <- unlist(strsplit(df$`data.dwc:eventDate`[i], "/"))
      start_date <- AsDate(date_range[1], on.parse.failure = "warn")

      # Use the date to the left of the forward slash
      df$datecollected[i] <- start_date
    } else {
      # If "data.dwc:eventDate" is present but without a slash, use AsDate()
      df$datecollected[i] <- AsDate(df$`data.dwc:eventDate`[i],
                                    on.parse.failure = "warn")
    }
  } else {
    # If "data.dwc:eventDate" is not present, construct the date
    year <- df$`data.dwc:year`[i]
    month <- df$`data.dwc:month`[i]
    day <- df$`data.dwc:day`[i]

    # Construct the date based on available components
    if (!is.na(year) && !is.na(month) && !is.na(day)) {
      df$datecollected[i] <- AsDate(paste(year, month, day, sep = "-"),
                                    on.parse.failure = "warn")
    } else if (!is.na(year) && !is.na(month)) {
      df$datecollected[i] <- AsDate(paste(year, month, "01", sep = "-"),
                                    on.parse.failure = "warn")
    } else if (!is.na(year)) {
      df$datecollected[i] <- AsDate(paste(year, "01", "01", sep = "-"),
                                    on.parse.failure = "warn")
    } else {
      # Handle the case where there is no information to construct a date
      df$datecollected[i] <- NA
    }
  }
}

flipTime can be installed with the following commands:

require(devtools)
install_github("Displayr/flipTime")

As a workaround this code will work but it can be seen here that there is some considerable overhead in the logic required to generate the "proper" datecollected it is an n+1 problem and fixing the data would require no overhead at all. A less precise way to do this using native functions would be the following:

df <- within(df, datecollected <- as.Date(df$`data.dwc:eventDate`))

You can see here that there is no logic for determining dates from a separate year, month, or day but it demonstrates that as.Date can process all values in the DataFrame at once while something like flipDate which is more precise is unable to take all the values as a single parameter and transform them all at once. If there are any good suggestions for accomplishing this with little to no overhead and then modify the R library to use them we are open to suggestions but from these initial attempts at performing the suggested change getting the correct data straight from the source seems like the preferable solution at the moment.

@mgaynor1
Copy link
Collaborator Author

Even with modifications to the ingestion process, the columns we return by default need to be modified.

I suggest we by default return date columns that are in the DarwinCore format. We should not modify these fields at all. When selecting fields to return to users by default, we should return interpretable fields that any user could use - this modification is meant to lower the learning curve for a data-user. Date modification by GBIF and iDigBio are important for indexing but are not helpful for phenological studies. Additionally, datecollected is not documented and is an internal field, we should not return a field that users cannot interpret. We should not modify any field values in any functions available within this package. I am not suggesting any modification.

Once a researcher or data-user downloads their data with this function, they then can modify it however they wish. We actually use the 4 date columns above in a function here: https://github.com/nataliepatten/gatoRs/blob/main/R/remove_duplicates.R

I suggested all 4 columns because, in my experience, some collections only fill out the eventDate, while others only fill out the day, month, and year.

@jbennettufl
Copy link
Contributor

Ok, just to be super clear here you want me to remove this line:

"datecollected",
and replace it with "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day"? Is this correct?

@mgaynor1
Copy link
Collaborator Author

Yes. Please add documentation as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants