-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added basic muckrock archiving implementation. #109
base: master
Are you sure you want to change the base?
Conversation
src/apps/app_muckrock.erl
Outdated
get_latest() -> | ||
{ok, {{_, 200, _}, _, Body}} = | ||
httpc:request("https://www.muckrock.com/api_v1/foia/?ordering=-id&status=done"), | ||
case re:run(Body, "([^ \"]*?\.pdf)", [global, {capture, all_but_first, list}]) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier to download the data as JSON and just pick all the PDF URLs from the JSON?
Edit: Seems like results -> communications -> files -> ffile
are the URLs to all PDFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this was done for speed. Let's see if MR like the first version then we know whether to spend time on improving it.
src/apps/app_muckrock.erl
Outdated
tags = | ||
[ | ||
{"app_name", "MuckRock"}, | ||
{"Original-File-Location", Item}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest calling this one Original-URL
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
src/apps/app_muckrock.erl
Outdated
ssl:start(), | ||
Wallet = ar_wallet:load_keyfile(WalletFile), | ||
Queue = app_queue:start(Wallet), | ||
spawn(fun() -> server(Queue, []) end). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about how app_muckrock will be used, but wouldn't it makes sense to populate the AlreadySeen
argument to server/2
by using app_search
to find all already uploaded PDFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be very nice.
src/apps/app_muckrock.erl
Outdated
get_latest() -> | ||
{ok, {{_, 200, _}, _, Body}} = | ||
httpc:request("https://www.muckrock.com/api_v1/foia/?ordering=-id&status=done"), | ||
case re:run(Body, "([^ \"]*?\.pdf)", [global, {capture, all_but_first, list}]) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex doesn't look correct and it extracts "relevant pdf" out of this part: "reattached the 3 relevant pdfs. The FY17 Tre"
@samcamwilliams I finished reviewing and then I pushed a commit with white-space changes, so all comments are invalidly flagged as "outdated". I would be happy to implement all my suggested changes if you want me to. |
I accidentally closed the PR so I re-opened it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to implement all my suggested changes if you want me to.
Let's wait until we hear back from them. It might be worth generalising this module a bit so that we can easily re-use it for other integrations. Polling a URL, extracting new links of a certain type, then archiving them seems to be a commonly desired pattern.
If they like it, let's generalise it. If they don't, perhaps we should start a new repo for small, fast prototypes like this.
9b1149b
to
5908db5
Compare
No description provided.