Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add archive.is as third archiving option #35

Open
adam3smith opened this issue Sep 30, 2021 · 1 comment
Open

Add archive.is as third archiving option #35

adam3smith opened this issue Sep 30, 2021 · 1 comment

Comments

@adam3smith
Copy link
Contributor

No description provided.

@mccallc
Copy link
Collaborator

mccallc commented Feb 7, 2022

OK, I've come to the conclusion that implementing this source is not feasible. Is there something obvious I'm missing? Please let me know if there is.

There is a now-abandoned python implementation for submitting to archive.is (last updated 2020), but trying to use it now always generates a HTTP 429 error. I ran into the same problem trying to emulate the main form submission with rvest. If you try to browse to the site manually after that, you get hit with a CAPTCHA. I think they've walled the service off pretty well from basic scrapers.

The Memento robust links API discourages use for explicit archiving, and the tool they recommend for this purpose, archivenow's archive.is handler, implements submitting collections of URLs to archive.is by manually commandeering a running instance of Firefox (?!) through a library called selenium. The program itself isn't that complex, just such weird dependencies make it pretty hostile to implementation in R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants