-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code review 2021-02-09 #1
Comments
Part 2 - all minor
|
Part 3/3
These actually are intentional! I know they seem weird, but they're based on individual searches of the conferences to see what else shows up. Typically, either the \b is there because otherwise we saw something that was incorrect, or the \b is not there because there was, say, a workshop that was considered part of the conference that extended the acronym that I wanted to make sure was still included. I know, it's very weird, but I promise there's method to the madness :)
No and because (if I recall correctly) there's another couple conferences that start with the exact same thing and then go on to "and but the other topics are all different and I didn't want to disinclude them separately. No exactly conference title because there's still that comes before and after the conference title that is valid like (again, pulling from memory) "the 14th" or the "the 15th" or blah blah workshop, and I didn't want to include each of those things separately. Nice clean project, everything well-commented 👍 ! |
Part 1 (to be continued!)
Despite my many comments, I generally found aggregate_organizations.py clean and easy to read! 💯
Possibly more important things to look at
I can see how this would be confusing, but no! I meant if we aren't using the list of orgs not to roll up, which is called no_roll_up. Maybe this will be clearer if I rename it the no_roll_up list so it matches the variable? And explain more.
Sigh, yes, you're totally correct. I don't think this ever happens in practice but we should be prepared for if it does.
I basically arrived at the current code iteration because I tried this a billion other ways that were more nested or recursive and they were all incredibly slow and inefficient. And doing it this way gradually broke my brain because there are so many potential corner cases like this and because I'm not handling it in a nice neat nested way I have to handle all of them. But the nice neat nested ways were all So Slow. If you have better ideas I'd be very very grateful. Or like if you just have a better idea for how to fix this particular bad way.
Huh, I feel like something must have gone wrong here; I feel like this used to be inside the else? That solves this, right? I wonder if I edited that in some time after my last run...
Minor/nits
No, it does not! So no reason to. Only the parent org's location is used, so we only ever add one location :).
At this point, yes, it is quite fast! As I noted above, when I tried to use some other techniques to map child to parent I had some speed issues, but definitely the organization attribute parts have never caused any. So I'm inclined not to worry about this.
Hmm, I guess it could be a very minor concern? It's not going to be an issue for an individual organization because parents are deduplicated, but I guess there's a chance that, say, if multiple organizations were aggregated together and they all had the same parent but some of them were acquired by that parent and others weren't that parent could appear repeatedly. I've updated; let me know if the change looks reasonable.
No, because we're pulling these from our original dataset, and no child in the original dataset will have the same parent id twice! (I actually ensure that doubly so in the deduplicate_companies.py code, so I'm confident of it).
That is reasonable. I pulled it out for a few reasons, and they may be totally irrational: 1. I'm not that into having a hardcoded list like this in the first place, and I'm still contemplating whether it would be better to have it be something pulled from SQL etc. It seems silly to do that right now when it's just two organizations though. But I sort of wanted it there to remind me and make it easier to separate out if it grows. and 2. My inclination is for class variables to be things that get acted on/used or modified throughout the class, whereas globals are hardcoded values used by the program. I realize some of this inclination comes from the C/C++/Java universe where globals are much more of a thing than in Python, but it still makes me feel squeamish to have just. Hardcoded lists like this living in my classes. I'm happy to change this if you have strong feelings though, just explaining my (probably irrational) reasoning.
if not args.output_file.endswith(".jsonl")
?Sure :)
add_value(attribute, value_to_add)
method (called likeadd_value("location", your_location_dict)
. It might make your life easier when adding new fields down the road, and it would mean you'd be able to replace most ofupdate_organization_data
orupdate_organization_identifiers
with iteration over a set of attributes to update. However, if ultimately you wanted to do a lot of field-specific preprocessing in the add_* methods, this might not be a good idea. Maybe something to think about for later.I think I want the ability to add field-specific preprocessing, even if I'm not doing it right now. So at least for now I'm going to skip this, although I'll keep it in mind!
I added a top-level description :) sorry about that. I'm not sure why I didn't just have them for everything. I should probably go back and do that.
Its parents can have parents :(. But its parents never have more than one parent! At least in our current dataset. Again, this is currently Very Hacky Code. If it helps, this is one of the few things the unit tests are currently validating (that is, they're asserting that once this code is run, everything only has one parent!). So if something goes wrong as more organizations are added and suddenly some parent has multiple parents (ugh I really hope we never have that complicated an organization structure but maybe), the unit tests should catch it.
Again, though, if you have a non-hacky way to to this that maintains speed, I am all for it! I hate the hacky way I did it. A lot.
The text was updated successfully, but these errors were encountered: