-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate 3x increase in response body rows as of 2021_07_01 #124
Comments
Won't this process all rows but send null rows for the ones that don't match, since bigquery/dataflow/python/bigquery_import.py Lines 366 to 380 in 9f5d97f
Compare this for Lighthouse where it only runs for mobile: bigquery/dataflow/python/bigquery_import.py Lines 390 to 398 in 9f5d97f
|
Actually I think something else is going on here. It looks like it used to only return text rows in this table, but now returns all rows. For example this: SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.jpg' Returns 63,989 rows for 2020_07_01 (the 404s maybe?), and 78,429,806 for 2021_07_01. Similarly for fonts: SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.woff' Also seem to be including the WOFF bodies (explaining the growth in TB?), but not the JPG? We shouldn't be including binary bodies at all. |
Good find. For example in the response_bodies for almanac.httparchive.org I'm seeing URLs like https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2 and https://almanac.httparchive.org/static/images/home-hero.png. @pmeenan is this a WPT bug? |
Probably a Chrome change that changed WPT's text-only filtering. Looking now. |
Hmm, I'm having trouble reproducing it with almanac.httparchive.org. Any chance I can get a few pages that included WOFF bodies? Wonder if maybe there's some sort of interaction with WPT and some of the new custom metrics in case any of them are doing fetches (I'll triple-check to make sure WPT doesn't grab bodies outside of the actual test) |
Sorry, I meant pages other than the almanac that included woff or jpeg bodies. I'll see if I can write up a query. |
This query returned WOFF fonts with bodies: SELECT *
FROM `httparchive.response_bodies.2021_07_01_desktop`
WHERE url LIKE '%.woff' Can add a Weirdly when I ran the same for |
Here's an example: https://webpagetest.httparchive.org/result/210718_Dx12_23SR/1/details/#waterfall_view_step1 Also not repeatable in regular WPT, but then again only one of the fonts was captured so it could be intermittent? Then again, the same font body was also captured for Mobile HTTP Archive run: https://webpagetest.httparchive.org/result/210715_MxAT_N42X/1/details/#waterfall_view_step1 (request 49). Interestingly the waterfall is completely different between desktop and mobile but we still saw the issue. |
I think I understand the difference. In the old Java pipeline it omitted responses that had no body: bigquery/dataflow/java/src/main/java/com/httparchive/dataflow/BigQueryImport.java Line 329 in 9f5d97f
In the new Python pipeline, anything without a body defaults to the empty string: bigquery/dataflow/python/bigquery_import.py Line 163 in 9f5d97f
So a potential fix would be something like this: body = request.get('response').get('content').get('text', None)
if body == None:
continue We could clean up the BQ tables by deleting any row that has |
Strange, the font example above actually comes back from chrome as a utf8 string and there is no content type on the response. I can exclude it by extension but I think a better way may be to use the 'sec-fetch-dest' request header to not store anything that is requested as a font, image, video, etc |
Just rolled out the filtering to use the Sec-Fetch-Dest request header as an additional filter to keep images, fonts and video data out of the bodies. |
Regenerating the July 2021 tables using the new pipeline code. The mobile table is running now and will be ready in ~17 hours. The desktop table will be another day. |
Now that we've got the first
response_bodies
data in several months, it's strange to see a steep increase in the number of rows per table despite the table size (TB) not growing by as much: https://datastudio.google.com/u/0/reporting/1jh_ScPlCIbSYTf2r2Y6EftqmX9SQy4Gn/page/5ikeInvestigate the cause of the increased rows and deduplicate if needed. This table will be used by the 2021 Web Almanac, so it's important to make sure it doesn't introduce any data errors.
A couple of theories to start on:
The text was updated successfully, but these errors were encountered: