Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httparchive.latest.summary_requests_desktop/mobile not updated #76

Closed
foolip opened this issue Jul 22, 2019 · 4 comments
Closed

httparchive.latest.summary_requests_desktop/mobile not updated #76

foolip opened this issue Jul 22, 2019 · 4 comments
Labels

Comments

@foolip
Copy link

foolip commented Jul 22, 2019

summary_pages_desktop was updated on July 1, but the summary_requests_mobile table hasn't been updated since May 1:

image
image

As a result of this, it's not possible to use the two tables together by joining on pageid. I'm instead having to use JSON_EXTRACT(payload, '$._contentType') AS contentType on the full requests table.

Context: I'm updating the HTTP Archive for web compat decision making doc.

@rviscomi rviscomi added the bug label Jul 22, 2019
@rviscomi
Copy link
Member

The most recent runs of the scheduled queries for desktop/mobile summary requests have failed with this error:

Job 226352634162:scheduled_query_5d1b264f-0000-22bc-a112-f4f5e80d17d0 (table summary_requests_mobile) failed with error INVALID_ARGUMENT: Cannot read field '_gzip_save' of type STRING as INT64; JobID: 226352634162:scheduled_query_5d1b264f-0000-22bc-a112-f4f5e80d17d0

_gzip_save is type STRING in 2019_06_01_desktop, so the wildcard query is failing. I'll convert that field to INTEGER and rerun the scheduled queries.

@rviscomi
Copy link
Member

rviscomi commented Jul 22, 2019

A few other tables have this type mismatch due to HTTPArchive/httparchive.org#135 so it will be a bit more work to get the scheduled query running.

Instead I'll update the Dataflow pipeline to handle the copying of the latest tables. The July crawl is critical so I'll wait until that's done to make the changes.

For now I've manually copied the 2019_06_01 summary_requests tables into the latest dataset, so your queries should be working now.

@foolip
Copy link
Author

foolip commented Jul 29, 2019

Thanks @rviscomi! Looking forward to the July data :)

@rviscomi
Copy link
Member

rviscomi commented Jun 3, 2020

Latest summary_requests should be generated properly thanks to HTTPArchive/httparchive.org#203.

I'll create a new issue to track handling latest table creation from the Dataflow pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants