code review 2021-02-09 #1

jmelot · 2021-02-09T23:02:48Z

Part 1 (to be continued!)

Despite my many comments, I generally found aggregate_organizations.py clean and easy to read! 💯

Possibly more important things to look at

Did you mean "if we aren't using the roll-up list"? https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L123

I can see how this would be confusing, but no! I meant if we aren't using the list of orgs not to roll up, which is called no_roll_up. Maybe this will be clearer if I rename it the no_roll_up list so it matches the variable? And explain more.

https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L138 Low probability I guess, but what if an org has two parents, one of which has a parent, appearing in this order: [new_parent, new_parent_with_a_parent]. The new_parent does change, so this code block is not entered and new_parent_with_a_parent as the new_parent.

Sigh, yes, you're totally correct. I don't think this ever happens in practice but we should be prepared for if it does.

I basically arrived at the current code iteration because I tried this a billion other ways that were more nested or recursive and they were all incredibly slow and inefficient. And doing it this way gradually broke my brain because there are so many potential corner cases like this and because I'm not handling it in a nice neat nested way I have to handle all of them. But the nice neat nested ways were all So Slow. If you have better ideas I'd be very very grateful. Or like if you just have a better idea for how to fix this particular bad way.

https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L145 the value of parent in this if block will always be the last element of child_mapping_dict[child] if len(child_mapping_dict[child]) > 1, is that desired?

Huh, I feel like something must have gone wrong here; I feel like this used to be inside the else? That solves this, right? I wonder if I edited that in some time after my last run...

Minor/nits

https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L29 does add_location get called multiple times per org? If so, any reason to modify the behavior here to choose the "best" (least null? most frequently occurring?) location?

No, it does not! So no reason to. Only the parent org's location is used, so we only ever add one location :).

Are you happy with the performance of aggregate_organizations generally? It crossed my mind that storing Organization's list attributes as sets or dicts might speed things up a little, but these lists are probably quite short in general in which case this probably wouldn't make much difference.

At this point, yes, it is quite fast! As I noted above, when I tried to use some other techniques to map child to parent I had some speed issues, but definitely the organization attribute parts have never caused any. So I'm inclined not to worry about this.

Do we need to worry about the same parent id getting added multiple times with different attributes here? https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L88

Hmm, I guess it could be a very minor concern? It's not going to be an issue for an individual organization because parents are deduplicated, but I guess there's a chance that, say, if multiple organizations were aggregated together and they all had the same parent but some of them were acquired by that parent and others weren't that parent could appear repeatedly. I've updated; let me know if the change looks reasonable.

Do we need to deduplicate parent id here? https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L113

No, because we're pulling these from our original dataset, and no child in the original dataset will have the same parent id twice! (I actually ensure that doubly so in the deduplicate_companies.py code, so I'm confident of it).

I'd suggest making this a class variable of OrganizationAggregator: https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L7

That is reasonable. I pulled it out for a few reasons, and they may be totally irrational: 1. I'm not that into having a hardcoded list like this in the first place, and I'm still contemplating whether it would be better to have it be something pulled from SQL etc. It seems silly to do that right now when it's just two organizations though. But I sort of wanted it there to remind me and make it easier to separate out if it grows. and 2. My inclination is for class variables to be things that get acted on/used or modified throughout the class, whereas globals are hardcoded values used by the program. I realize some of this inclination comes from the C/C++/Java universe where globals are much more of a thing than in Python, but it still makes me feel squeamish to have just. Hardcoded lists like this living in my classes. I'm happy to change this if you have strong feelings though, just explaining my (probably irrational) reasoning.

https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L233 Most ridiculously minor comment: maybeif not args.output_file.endswith(".jsonl") ?

Sure :)

Reading this https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L203 it struck me that you could probably abstract most/all of your Organization.add_* methods into one add_value(attribute, value_to_add) method (called like add_value("location", your_location_dict). It might make your life easier when adding new fields down the road, and it would mean you'd be able to replace most of update_organization_data or update_organization_identifiers with iteration over a set of attributes to update. However, if ultimately you wanted to do a lot of field-specific preprocessing in the add_* methods, this might not be a good idea. Maybe something to think about for later.

I think I want the ability to add field-specific preprocessing, even if I'm not doing it right now. So at least for now I'm going to skip this, although I'll keep it in mind!

The interleaving of code and comments here is almost poetic :D https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L117 As a reader, I'd have benefited from a top-level description of what you are doing and your approach, I did get confused reading this, but maybe this is just me

I added a top-level description :) sorry about that. I'm not sure why I didn't just have them for everything. I should probably go back and do that.

Its parents can't have parents... right? https://github.com/georgetown-cset/ai-companies-visualization/blob/master/aggregate_organizations.py#L130 And what if the parent has more than one parent?

Its parents can have parents :(. But its parents never have more than one parent! At least in our current dataset. Again, this is currently Very Hacky Code. If it helps, this is one of the few things the unit tests are currently validating (that is, they're asserting that once this code is run, everything only has one parent!). So if something goes wrong as more organizations are added and suddenly some parent has multiple parents (ugh I really hope we never have that complicated an organization structure but maybe), the unit tests should catch it.

Again, though, if you have a non-hacky way to to this that maintains speed, I am all for it! I hate the hacky way I did it. A lot.

The text was updated successfully, but these errors were encountered:

jmelot · 2021-02-09T23:29:49Z

Part 2 - all minor

In the readme: 3.) Update... maybe "Replace high_resolution_entities.organizations with the data from..."
You have really nice comments at the top of your other queries, maybe worth adding one to this query too (although perhaps the query name makes it sort of self-explanatory!) https://github.com/georgetown-cset/ai-companies-visualization/blob/master/sql/selecting_no_grid_ai_publications.sql
I am only complaining about whitespace because the rest of this is so beautifully formatted - did something happen here? https://github.com/georgetown-cset/ai-companies-visualization/blob/master/sql/selecting_grid_ai_patents_current_assignee.sql#L25

jmelot · 2021-02-10T14:50:39Z

Part 3/3

There's some inconsistency with whether/when you use \b around the acronyms, just mentioning it in case not intentional: https://github.com/georgetown-cset/ai-companies-visualization/blob/master/sql/pulling_publication_counts_in_top_ai_publications_grid.sql

These actually are intentional! I know they seem weird, but they're based on individual searches of the conferences to see what else shows up. Typically, either the \b is there because otherwise we saw something that was incorrect, or the \b is not there because there was, say, a workshop that was considered part of the conference that extended the acronym that I wanted to make sure was still included. I know, it's very weird, but I promise there's method to the madness :)

No and? Maybe do \band\b or limit the search to the exact conference name? https://github.com/georgetown-cset/ai-companies-visualization/blob/master/sql/pulling_publication_counts_in_top_ai_publications_grid.sql#L24

No and because (if I recall correctly) there's another couple conferences that start with the exact same thing and then go on to "and but the other topics are all different and I didn't want to disinclude them separately. No exactly conference title because there's still that comes before and after the conference title that is valid like (again, pulling from memory) "the 14th" or the "the 15th" or blah blah workshop, and I didn't want to include each of those things separately.

Nice clean project, everything well-commented 👍 !

jmelot assigned rggelles Feb 9, 2021

rggelles closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code review 2021-02-09 #1

code review 2021-02-09 #1

jmelot commented Feb 9, 2021 •

edited by rggelles

Loading

jmelot commented Feb 9, 2021 •

edited by rggelles

Loading

jmelot commented Feb 10, 2021 •

edited by rggelles

Loading

code review 2021-02-09 #1

code review 2021-02-09 #1

Comments

jmelot commented Feb 9, 2021 • edited by rggelles Loading

jmelot commented Feb 9, 2021 • edited by rggelles Loading

jmelot commented Feb 10, 2021 • edited by rggelles Loading

jmelot commented Feb 9, 2021 •

edited by rggelles

Loading

jmelot commented Feb 9, 2021 •

edited by rggelles

Loading

jmelot commented Feb 10, 2021 •

edited by rggelles

Loading