Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow different date formats #146

Open
zbremmer opened this issue Nov 27, 2020 · 13 comments
Open

Allow different date formats #146

zbremmer opened this issue Nov 27, 2020 · 13 comments

Comments

@zbremmer
Copy link

I'd like to modify the Day validator to allow for other date formats: MM/DD/YYYY, MM-DD-YYYY, DD-MM-YYYY, etc. The user could specify a specific format or omit the argument to allow for matching against any of the approved formats (allowing for more flexibility in the data).

I'm happy to build the solution if you think this is useful.

@mechie
Copy link
Contributor

mechie commented Nov 27, 2020

I would favor the current behavior for Day and Timestamp remaining the default unless a major version jump is made--making them match against now-invalid formats could break downstream consumers expecting only the currently-valid format.

An "any" option seems like an iffy/moving target, given just how many date formats there are out there (assuming Yamale doesn't stop at the first N proposed or set out to include nearly every format out there), e.g. Wikipedia: Date format by country.

How about a format='' option that accepts a strptime format str like "%Y/%m/%d" or "%c"? This could allow validating most date/datetime format out there without playing favorites (other than python's ISO8601), though I'm not sure what all would need tweaking on the code side (would the constraints need to be processed using the same format string?).

@zbremmer
Copy link
Author

I think using the strptime format is a good idea. The default for this could be left as is so it doesn't break any current workloads.

I can investigate what this would require on the code side. I need this functionality for a current project and am happy to share if others would find it useful.

@mildebrandt
Copy link

Hi! Thanks for using Yamale!

Currently, pyyaml will deserialize ISO8601 date formats into python datetime.date or datetime.datetime objects. Yamale takes advantage of this behavior for validating the min/max constraints...and for validity. That should remain the default as @mechie suggested.

I also agree with the suggestion of a format option within the Day and Timestamp validators. Be sure to check that the value is a str and not already converted into a date or datetime object. And please include lots of tests. :)

If updating the validator gets to be too much, you can fall back on regex validators if you don't need the min/max constraints.

@zbremmer
Copy link
Author

zbremmer commented Dec 1, 2020

Hi guys,

I dug into this over the weekend and have a workable but somewhat inelegant solution. I wanted to run it past you for a gut check before proceeding.

I can create a new constraint called DateFormat that will test the input by trying to coerce it to a date using datetime.strptime(). In order for this to work, however, we need to change the _is_valid() function of the Day and Timestamp validators. The default validation function first tests if the value from the data file was parsed as a date, which won't be the case unless a user is using the default YYYY-MM-DD format.

My proposal is to update _is_valid() to first check if the class contains the DateFormat constraint. If it does, we try to validate using datetime.strptime(). If that passes (or the data is already a date) then we move through the rest of the validation. Given how the base validator validate() function is structured, this means that we will end up testing the strptime format twice--once in the validator _is_valid() function and again later in the validate() function when moving through all the class constraints.

Thoughts?

@mechie
Copy link
Contributor

mechie commented Dec 1, 2020

The way it's described, that sounds a-ok to me, specially since the relative inefficiency is opt-in through the constraint. I'm not a maintainer, though.

How do the Date/Timestamp min/max constraints account for this change? Are they going to be subclassed and passed DateFormat's value at __init__ and do their own strptime each? Or is the validator's validate doing the strptime and passing the now-datetime value down to all of its constraints? That would at least maintain the 0-2 strptime count, instead of growing to 0-4 (0-N if we count custom constraints).

Separate (but related) doubt, should all constraints on a Date/Timestamp follow DateFormat's tune? i.e., if I pass a DD-MM-YY format in to a Date, should I expect to be able to (or must?) format my min/max constraints that way?

@mildebrandt
Copy link

Another way to handle this would be to override the validate() method in the validator to parse the date. Then call the super method with the parsed date/datetime value.

For the constraints, I would think using the same format for min and max as the format would make sense. Then you'd change the code here: https://github.com/23andMe/Yamale/blob/master/yamale/validators/constraints.py#L36-L41

Thanks for taking this on!

@zbremmer
Copy link
Author

zbremmer commented Dec 9, 2020

Happy to help! I'm still working on this but have run into an issue. Any suggestions would be appreciated.

I took @mildebrandt's advice to override the validate() method in the validator. In this method I look at the constraints in the day validator instance and try to convert using strptime() if the format constraint exists. The problem, however, is that if the data file has any standard YYYY-MM-DD dates in it, the PyYAML base resolver will automatically change them to datetime.date format (same goes for timestamps). Then when the value is passed to validate() it is already a date and we've lost the original format. There is no way to directly test if this date is in the correct format in the yaml data file.

I'm not sure if there's a way we can easily look up the format of the data in the original file without a lot of extra work. Alternatively the default resolver for date/datetime in PyYAML could be overwritten by creating a custom class to extend yaml.SafeLoader.

Is there a better way to do this?

@mildebrandt
Copy link

That is a sticky issue. I haven't verified this, but it seems that the Constraint objects have access to all the constraints of the validator as part of the kwargs variable. So you could see if the format constraint exists when parsing the date....and if it is, use it. That would be in this code:

        try:  # Try to convert value
            # Is this value one of the datetime types?
            if kwtype == datetime.date:
                time = datetime.datetime.strptime(value, '%Y-%m-%d')
                return datetime.date(time.year, time.month, time.day)

            if kwtype == datetime.datetime:
                return datetime.datetime.strptime(value, '%Y-%m-%d %H:%M:%S')

            return kwtype(value)
        except (TypeError, ValueError):
            raise SyntaxError('%s is not a %s' % (key, kwtype))

@zbremmer
Copy link
Author

zbremmer commented Dec 10, 2020

I'm not sure I follow. The issue I'm seeing is on the data side, not on the constraint side.

Let's say we have a data.yaml file as follows:

---
test-file:
name: test_file.csv
create_date: 2020-01-01

When we call yamale.make_data('data.yaml'), the file is parsed by default with PyYAML. When it parses this file, we end up with a data object that looks like this:

[({'test-file': {'name': 'test_file.csv', 'create_date': datetime.date(2020, 1, 1)}}, 'data.yaml')]

PyYAML is parsing anything that matches a specific regex as a date, we lose the original formatting of the date so there is no way (that I can figure out) to compare the original format of the date to the format passed to the Day validator.

I'll have to research it more, but one workaround would be if the PyYAML parser only converts strings if they match the YYYY-MM-DD format. If so, we can check that the format argument is %Y-%m-%d if the value passed to validator is a datetime.date

I still need to investigate if this issue shows up if using rumel as the loader.

@mildebrandt
Copy link

Ah, I misunderstood....though I think you'll still have the issue I mentioned when you start testing constraints.

For this, we could tell the parser to not convert dates and timestamps. However, we have to be VERY careful that we do EXACTLY what PyYAML and ruamel do when parsing dates and times when a format constraint doesn't exist for the validator. Did I put enough bold and capital letters? :)

Here are a couple examples:
https://stackoverflow.com/questions/34667108/ignore-dates-and-times-while-parsing-yaml
https://stackoverflow.com/questions/50900727/skip-converting-entities-while-loading-a-yaml-string-using-pyyaml

@zbremmer
Copy link
Author

I created a pull request for this feature. I was able to mimic the default PyYAML and ruamel behavior in a function called util.parse_default_date() which solves the issue mentioned above (the regex I'm using in that function is the same one used in PyYAML and ruamel).

I wrote and ran several tests which all passed. Please take a look and let me know if there is anything that I missed!

@mildebrandt
Copy link

Sorry it's taking a long time to get to this, we're pretty busy over here. Hopefully I can schedule some time soon.

@zbremmer
Copy link
Author

No worries, I totally get it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants