Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point sync_csv script to legacy subdomain #37

Closed
rviscomi opened this issue Apr 23, 2018 · 1 comment
Closed

Point sync_csv script to legacy subdomain #37

rviscomi opened this issue Apr 23, 2018 · 1 comment
Assignees
Labels

Comments

@rviscomi
Copy link
Member

sync_csv.sh points to http://httparchive.org for the downloads folder containing CSV data. Since the launch of the new website, the new host for this data is https://legacy.httparchive.org. The URLs need to be updated in this script to point to the appropriate server.

This caused a pipeline failure during the 2018_04_01 crawl and had to be manually fixed and restarted to complete.

@rviscomi
Copy link
Member Author

FYI this issue was identified by the logs:

root@worker:~/code# tail /var/log/HAimport.log

Processing Apr_1_2018, mobile: 1, archive: mobile_Apr_1_2018

Downloading data for mobile_Apr_1_2018

https://httparchive.org/downloads/httparchive_mobile_Apr_1_2018_pages.csv.gz:

2018-04-13 08:00:02 ERROR 404: NOT FOUND.

Pages data for Apr_1_2018 is missing, exiting

Processing Apr_1_2018, mobile: 0, archive: Apr_1_2018

Downloading data for Apr_1_2018

https://httparchive.org/downloads/httparchive_Apr_1_2018_pages.csv.gz:

2018-04-13 15:00:02 ERROR 404: NOT FOUND.

Pages data for Apr_1_2018 is missing, exiting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant