Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate 3x increase in response body rows as of 2021_07_01 #124

Closed
rviscomi opened this issue Jul 28, 2021 · 13 comments · Fixed by #125
Closed

Investigate 3x increase in response body rows as of 2021_07_01 #124

rviscomi opened this issue Jul 28, 2021 · 13 comments · Fixed by #125
Assignees

Comments

@rviscomi
Copy link
Member

Now that we've got the first response_bodies data in several months, it's strange to see a steep increase in the number of rows per table despite the table size (TB) not growing by as much: https://datastudio.google.com/u/0/reporting/1jh_ScPlCIbSYTf2r2Y6EftqmX9SQy4Gn/page/5ike

image

Investigate the cause of the increased rows and deduplicate if needed. This table will be used by the 2021 Web Almanac, so it's important to make sure it doesn't introduce any data errors.

A couple of theories to start on:

  • Bisecting the HARs results in some null rows
  • Bisecting the HARs results in some duplicate rows
@tunetheweb
Copy link
Member

Won't this process all rows but send null rows for the ones that don't match, since get_response_bodies_a returns null for half the rows (and similarly for get_response_bodies_b)?

(hars
| 'MapResponseBodiesA' >> beam.FlatMap(get_response_bodies_a)
| 'WriteResponseBodiesA' >> beam.io.WriteToBigQuery(
get_bigquery_uri(known_args.input, 'response_bodies'),
schema='page:STRING, url:STRING, body:STRING, truncated:BOOLEAN',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
(hars
| 'MapResponseBodiesB' >> beam.FlatMap(get_response_bodies_b)
| 'WriteResponseBodiesB' >> beam.io.WriteToBigQuery(
get_bigquery_uri(known_args.input, 'response_bodies'),
schema='page:STRING, url:STRING, body:STRING, truncated:BOOLEAN',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))

Compare this for Lighthouse where it only runs for mobile:

# Skip Lighthouse for desktop HARs.
if known_args.input.startswith('android'):
(hars
| 'MapLighthouseReports' >> beam.FlatMap(get_lighthouse_reports)
| 'WriteLighthouseReports' >> beam.io.WriteToBigQuery(
get_bigquery_uri(known_args.input, 'lighthouse'),
schema='url:STRING, report:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))

@tunetheweb
Copy link
Member

Actually I think something else is going on here. It looks like it used to only return text rows in this table, but now returns all rows.

For example this:

SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.jpg'

Returns 63,989 rows for 2020_07_01 (the 404s maybe?), and 78,429,806 for 2021_07_01.

Similarly for fonts:

SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.woff'

Also seem to be including the WOFF bodies (explaining the growth in TB?), but not the JPG? We shouldn't be including binary bodies at all.

@rviscomi
Copy link
Member Author

Good find. For example in the response_bodies for almanac.httparchive.org I'm seeing URLs like https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2 and https://almanac.httparchive.org/static/images/home-hero.png. @pmeenan is this a WPT bug?

@pmeenan
Copy link
Member

pmeenan commented Jul 28, 2021

Probably a Chrome change that changed WPT's text-only filtering. Looking now.

@pmeenan
Copy link
Member

pmeenan commented Jul 28, 2021

Hmm, I'm having trouble reproducing it with almanac.httparchive.org. Any chance I can get a few pages that included WOFF bodies?

Wonder if maybe there's some sort of interaction with WPT and some of the new custom metrics in case any of them are doing fetches (I'll triple-check to make sure WPT doesn't grab bodies outside of the actual test)

@rviscomi
Copy link
Member Author

SELECT * FROM `httparchive.response_bodies.2021_07_01_desktop` WHERE page = 'https://almanac.httparchive.org/'

Be aware this processes 15 TB.

page url
https://almanac.httparchive.org/ https://almanac.httparchive.org/
https://almanac.httparchive.org/ https://almanac.httparchive.org/en/2020/
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/normalize.css?v=112272e51c80ffe5bd01becd2ce7d656
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/almanac.css?v=98a5dc5255545dca5c3f3826dd3567bd
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/index.css?v=dd16269b4c1c3af01e8b635f00c16f69
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/js/almanac.js?v=c8c72daa5804a31cf1a59084b8745691
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/js/web-vitals.js?v=12282d9f0577af53119267cc8d81dc64
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/js/send-web-vitals.js?v=f176ee2628e8a2a549a6f5f3e122ee22
https://almanac.httparchive.org/ https://www.googletagmanager.com/gtag/js?id=UA-22381566-3
https://almanac.httparchive.org/ https://www.google-analytics.com/plugins/ua/linkid.js
https://almanac.httparchive.org/ https://www.google-analytics.com/j/collect?v=1&_v=j91&a=536456327&t=pageview&_s=1&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&_u=aGBAAUIhAAAAAC~&jid=904247328&gjid=386119546&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&_r=1&gtm=2ou7e0&tc=x&z=1146007784
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Poppins-Light.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Regular.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Poppins-Bold.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Black.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/home-hero.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/css/page.css?v=ca124dcff4fd4fa1c5edb0e23b77f839
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/character-markup.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/character-star.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/character-hat.png
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/fonts/Lato-Italic.woff2
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&ni=1&_s=2&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=Web%20Vitals&ea=FCP&el=v2-1626329306589-8113144612587&ev=1022&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1268143968
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=3&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=print-mode&el=false&ev=0&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1992814082
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=4&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=min-sheets-width&el=true&ev=1&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1661527953
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=5&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=data-saver&el=not-enabled&ev=0&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1725817895
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=6&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=connection-type&el=4g&ev=1&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=914507272
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&_s=7&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=user&ea=hi-res-canvas&el=supported&ev=1&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1818432849
https://almanac.httparchive.org/ https://www.google-analytics.com/collect?v=1&_v=j91&a=536456327&t=event&ni=1&_s=8&dl=https%3A%2F%2Falmanac.httparchive.org%2Fen%2F2020%2F&ul=en-us&de=UTF-8&dt=The%202020%20Web%20Almanac&sd=24-bit&sr=1920x1200&vp=1351x696&je=0&ec=Web%20Vitals&ea=TTFB&el=v2-1626329306590-9863686315487&ev=633&_u=aGBAAUIhAAAAAC~&jid=&gjid=&cid=1745670297.1626329307&tid=UA-22381566-3&_gid=882409207.1626329307&gtm=2ou7e0&tc=x&z=1635327878
https://almanac.httparchive.org/ https://almanac.httparchive.org/static/images/favicon.ico
https://almanac.httparchive.org/ https://www.google-analytics.com/analytics.js

@pmeenan
Copy link
Member

pmeenan commented Jul 28, 2021

Sorry, I meant pages other than the almanac that included woff or jpeg bodies. I'll see if I can write up a query.

@tunetheweb
Copy link
Member

This query returned WOFF fonts with bodies:

SELECT *
FROM `httparchive.response_bodies.2021_07_01_desktop`
WHERE url LIKE '%.woff'

Can add a AND body IS NOT NULL at end if you want.

Weirdly when I ran the same for .jpg I got rows (which I shouldn’t) but the body column was empty (which is good at least), while for .woff it looked like binary WOFF data was in the body column.

@tunetheweb
Copy link
Member

Here's an example: https://webpagetest.httparchive.org/result/210718_Dx12_23SR/1/details/#waterfall_view_step1

image

Also not repeatable in regular WPT, but then again only one of the fonts was captured so it could be intermittent? Then again, the same font body was also captured for Mobile HTTP Archive run: https://webpagetest.httparchive.org/result/210715_MxAT_N42X/1/details/#waterfall_view_step1 (request 49). Interestingly the waterfall is completely different between desktop and mobile but we still saw the issue.

@rviscomi
Copy link
Member Author

I think I understand the difference. In the old Java pipeline it omitted responses that had no body:

if (content != null && content.has("text")) {

In the new Python pipeline, anything without a body defaults to the empty string:

body = request.get('response').get('content').get('text', '')

So a potential fix would be something like this:

    body = request.get('response').get('content').get('text', None)

    if body == None:
      continue

We could clean up the BQ tables by deleting any row that has body='' although that might delete legitimate response bodies that exist but are empty.

@pmeenan
Copy link
Member

pmeenan commented Jul 30, 2021

Strange, the font example above actually comes back from chrome as a utf8 string and there is no content type on the response. I can exclude it by extension but I think a better way may be to use the 'sec-fetch-dest' request header to not store anything that is requested as a font, image, video, etc

@pmeenan
Copy link
Member

pmeenan commented Jul 30, 2021

Just rolled out the filtering to use the Sec-Fetch-Dest request header as an additional filter to keep images, fonts and video data out of the bodies.

@rviscomi
Copy link
Member Author

rviscomi commented Aug 2, 2021

Regenerating the July 2021 tables using the new pipeline code. The mobile table is running now and will be ready in ~17 hours. The desktop table will be another day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants