Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code review 2021-02-09 #1

Closed
10 of 12 tasks
jmelot opened this issue Feb 9, 2021 · 2 comments
Closed
10 of 12 tasks

code review 2021-02-09 #1

jmelot opened this issue Feb 9, 2021 · 2 comments
Assignees

Comments

@jmelot
Copy link
Contributor

jmelot commented Feb 9, 2021

Part 1 (to be continued!)

Despite my many comments, I generally found aggregate_organizations.py clean and easy to read! 💯

Possibly more important things to look at

I can see how this would be confusing, but no! I meant if we aren't using the list of orgs not to roll up, which is called no_roll_up. Maybe this will be clearer if I rename it the no_roll_up list so it matches the variable? And explain more.

Sigh, yes, you're totally correct. I don't think this ever happens in practice but we should be prepared for if it does.

I basically arrived at the current code iteration because I tried this a billion other ways that were more nested or recursive and they were all incredibly slow and inefficient. And doing it this way gradually broke my brain because there are so many potential corner cases like this and because I'm not handling it in a nice neat nested way I have to handle all of them. But the nice neat nested ways were all So Slow. If you have better ideas I'd be very very grateful. Or like if you just have a better idea for how to fix this particular bad way.

Huh, I feel like something must have gone wrong here; I feel like this used to be inside the else? That solves this, right? I wonder if I edited that in some time after my last run...

Minor/nits

No, it does not! So no reason to. Only the parent org's location is used, so we only ever add one location :).

  • Are you happy with the performance of aggregate_organizations generally? It crossed my mind that storing Organization's list attributes as sets or dicts might speed things up a little, but these lists are probably quite short in general in which case this probably wouldn't make much difference.

At this point, yes, it is quite fast! As I noted above, when I tried to use some other techniques to map child to parent I had some speed issues, but definitely the organization attribute parts have never caused any. So I'm inclined not to worry about this.

Hmm, I guess it could be a very minor concern? It's not going to be an issue for an individual organization because parents are deduplicated, but I guess there's a chance that, say, if multiple organizations were aggregated together and they all had the same parent but some of them were acquired by that parent and others weren't that parent could appear repeatedly. I've updated; let me know if the change looks reasonable.

No, because we're pulling these from our original dataset, and no child in the original dataset will have the same parent id twice! (I actually ensure that doubly so in the deduplicate_companies.py code, so I'm confident of it).

That is reasonable. I pulled it out for a few reasons, and they may be totally irrational: 1. I'm not that into having a hardcoded list like this in the first place, and I'm still contemplating whether it would be better to have it be something pulled from SQL etc. It seems silly to do that right now when it's just two organizations though. But I sort of wanted it there to remind me and make it easier to separate out if it grows. and 2. My inclination is for class variables to be things that get acted on/used or modified throughout the class, whereas globals are hardcoded values used by the program. I realize some of this inclination comes from the C/C++/Java universe where globals are much more of a thing than in Python, but it still makes me feel squeamish to have just. Hardcoded lists like this living in my classes. I'm happy to change this if you have strong feelings though, just explaining my (probably irrational) reasoning.

Sure :)

  • Reading this https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L203 it struck me that you could probably abstract most/all of your Organization.add_* methods into one add_value(attribute, value_to_add) method (called like add_value("location", your_location_dict). It might make your life easier when adding new fields down the road, and it would mean you'd be able to replace most of update_organization_data or update_organization_identifiers with iteration over a set of attributes to update. However, if ultimately you wanted to do a lot of field-specific preprocessing in the add_* methods, this might not be a good idea. Maybe something to think about for later.

I think I want the ability to add field-specific preprocessing, even if I'm not doing it right now. So at least for now I'm going to skip this, although I'll keep it in mind!

I added a top-level description :) sorry about that. I'm not sure why I didn't just have them for everything. I should probably go back and do that.

Its parents can have parents :(. But its parents never have more than one parent! At least in our current dataset. Again, this is currently Very Hacky Code. If it helps, this is one of the few things the unit tests are currently validating (that is, they're asserting that once this code is run, everything only has one parent!). So if something goes wrong as more organizations are added and suddenly some parent has multiple parents (ugh I really hope we never have that complicated an organization structure but maybe), the unit tests should catch it.

Again, though, if you have a non-hacky way to to this that maintains speed, I am all for it! I hate the hacky way I did it. A lot.

@jmelot
Copy link
Contributor Author

jmelot commented Feb 9, 2021

Part 2 - all minor

@jmelot
Copy link
Contributor Author

jmelot commented Feb 10, 2021

Part 3/3

These actually are intentional! I know they seem weird, but they're based on individual searches of the conferences to see what else shows up. Typically, either the \b is there because otherwise we saw something that was incorrect, or the \b is not there because there was, say, a workshop that was considered part of the conference that extended the acronym that I wanted to make sure was still included. I know, it's very weird, but I promise there's method to the madness :)

No and because (if I recall correctly) there's another couple conferences that start with the exact same thing and then go on to "and but the other topics are all different and I didn't want to disinclude them separately. No exactly conference title because there's still that comes before and after the conference title that is valid like (again, pulling from memory) "the 14th" or the "the 15th" or blah blah workshop, and I didn't want to include each of those things separately.

Nice clean project, everything well-commented 👍 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants