Investigate 3x increase in response body rows as of 2021_07_01 #124

rviscomi · 2021-07-28T05:19:42Z

Now that we've got the first response_bodies data in several months, it's strange to see a steep increase in the number of rows per table despite the table size (TB) not growing by as much: https://datastudio.google.com/u/0/reporting/1jh_ScPlCIbSYTf2r2Y6EftqmX9SQy4Gn/page/5ike

Investigate the cause of the increased rows and deduplicate if needed. This table will be used by the 2021 Web Almanac, so it's important to make sure it doesn't introduce any data errors.

A couple of theories to start on:

Bisecting the HARs results in some null rows
Bisecting the HARs results in some duplicate rows

The text was updated successfully, but these errors were encountered:

tunetheweb · 2021-07-28T05:35:21Z

Won't this process all rows but send null rows for the ones that don't match, since get_response_bodies_a returns null for half the rows (and similarly for get_response_bodies_b)?

bigquery/dataflow/python/bigquery_import.py

Lines 366 to 380 in 9f5d97f

    
           (hars 
        
             | 'MapResponseBodiesA' >> beam.FlatMap(get_response_bodies_a) 
        
             | 'WriteResponseBodiesA' >> beam.io.WriteToBigQuery( 
        
               get_bigquery_uri(known_args.input, 'response_bodies'), 
        
               schema='page:STRING, url:STRING, body:STRING, truncated:BOOLEAN', 
        
               write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, 
        
               create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)) 
        
           (hars 
        
             | 'MapResponseBodiesB' >> beam.FlatMap(get_response_bodies_b) 
        
             | 'WriteResponseBodiesB' >> beam.io.WriteToBigQuery( 
        
               get_bigquery_uri(known_args.input, 'response_bodies'), 
        
               schema='page:STRING, url:STRING, body:STRING, truncated:BOOLEAN', 
        
               write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, 
        
               create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))

Compare this for Lighthouse where it only runs for mobile:

bigquery/dataflow/python/bigquery_import.py

Lines 390 to 398 in 9f5d97f

    
           # Skip Lighthouse for desktop HARs. 
        
           if known_args.input.startswith('android'): 
        
             (hars 
        
               | 'MapLighthouseReports' >> beam.FlatMap(get_lighthouse_reports) 
        
               | 'WriteLighthouseReports' >> beam.io.WriteToBigQuery( 
        
                 get_bigquery_uri(known_args.input, 'lighthouse'), 
        
                 schema='url:STRING, report:STRING', 
        
                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, 
        
                 create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))

tunetheweb · 2021-07-28T07:09:02Z

Actually I think something else is going on here. It looks like it used to only return text rows in this table, but now returns all rows.

For example this:

SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.jpg'

Returns 63,989 rows for 2020_07_01 (the 404s maybe?), and 78,429,806 for 2021_07_01.

Similarly for fonts:

SELECT COUNT(1)
FROM `httparchive.response_bodies.2020_07_01_desktop`
WHERE url LIKE '%.woff'

Also seem to be including the WOFF bodies (explaining the growth in TB?), but not the JPG? We shouldn't be including binary bodies at all.

rviscomi · 2021-07-28T15:58:39Z

Good find. For example in the response_bodies for almanac.httparchive.org I'm seeing URLs like https://almanac.httparchive.org/static/fonts/Lato-Bold.woff2 and https://almanac.httparchive.org/static/images/home-hero.png. @pmeenan is this a WPT bug?

pmeenan · 2021-07-28T17:07:45Z

Probably a Chrome change that changed WPT's text-only filtering. Looking now.

pmeenan · 2021-07-28T17:20:23Z

Hmm, I'm having trouble reproducing it with almanac.httparchive.org. Any chance I can get a few pages that included WOFF bodies?

Wonder if maybe there's some sort of interaction with WPT and some of the new custom metrics in case any of them are doing fetches (I'll triple-check to make sure WPT doesn't grab bodies outside of the actual test)

rviscomi · 2021-07-28T17:22:30Z

SELECT * FROM `httparchive.response_bodies.2021_07_01_desktop` WHERE page = 'https://almanac.httparchive.org/'

Be aware this processes 15 TB.

pmeenan · 2021-07-28T17:31:30Z

Sorry, I meant pages other than the almanac that included woff or jpeg bodies. I'll see if I can write up a query.

tunetheweb · 2021-07-28T17:40:37Z

This query returned WOFF fonts with bodies:

SELECT *
FROM `httparchive.response_bodies.2021_07_01_desktop`
WHERE url LIKE '%.woff'

Can add a AND body IS NOT NULL at end if you want.

Weirdly when I ran the same for .jpg I got rows (which I shouldn’t) but the body column was empty (which is good at least), while for .woff it looked like binary WOFF data was in the body column.

tunetheweb · 2021-07-29T06:52:43Z

Here's an example: https://webpagetest.httparchive.org/result/210718_Dx12_23SR/1/details/#waterfall_view_step1

Also not repeatable in regular WPT, but then again only one of the fonts was captured so it could be intermittent? Then again, the same font body was also captured for Mobile HTTP Archive run: https://webpagetest.httparchive.org/result/210715_MxAT_N42X/1/details/#waterfall_view_step1 (request 49). Interestingly the waterfall is completely different between desktop and mobile but we still saw the issue.

rviscomi · 2021-07-29T23:10:57Z

I think I understand the difference. In the old Java pipeline it omitted responses that had no body:

bigquery/dataflow/java/src/main/java/com/httparchive/dataflow/BigQueryImport.java

Line 329 in 9f5d97f

if (content != null && content.has("text")) {

In the new Python pipeline, anything without a body defaults to the empty string:

bigquery/dataflow/python/bigquery_import.py

Line 163 in 9f5d97f

body = request.get('response').get('content').get('text', '')

So a potential fix would be something like this:

    body = request.get('response').get('content').get('text', None)

    if body == None:
      continue

We could clean up the BQ tables by deleting any row that has body='' although that might delete legitimate response bodies that exist but are empty.

pmeenan · 2021-07-30T13:30:12Z

Strange, the font example above actually comes back from chrome as a utf8 string and there is no content type on the response. I can exclude it by extension but I think a better way may be to use the 'sec-fetch-dest' request header to not store anything that is requested as a font, image, video, etc

pmeenan · 2021-07-30T13:57:57Z

Just rolled out the filtering to use the Sec-Fetch-Dest request header as an additional filter to keep images, fonts and video data out of the bodies.

rviscomi · 2021-08-02T03:36:46Z

Regenerating the July 2021 tables using the new pipeline code. The mobile table is running now and will be ready in ~17 hours. The desktop table will be another day.

rviscomi assigned rviscomi and paulcalvano Jul 28, 2021

rviscomi mentioned this issue Aug 2, 2021

Omit null response bodies #125

Merged

rviscomi closed this as completed in #125 Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate 3x increase in response body rows as of 2021_07_01 #124

Investigate 3x increase in response body rows as of 2021_07_01 #124

rviscomi commented Jul 28, 2021

tunetheweb commented Jul 28, 2021

tunetheweb commented Jul 28, 2021

rviscomi commented Jul 28, 2021

pmeenan commented Jul 28, 2021

pmeenan commented Jul 28, 2021

rviscomi commented Jul 28, 2021

pmeenan commented Jul 28, 2021

tunetheweb commented Jul 28, 2021

tunetheweb commented Jul 29, 2021

rviscomi commented Jul 29, 2021

pmeenan commented Jul 30, 2021

pmeenan commented Jul 30, 2021

rviscomi commented Aug 2, 2021

Investigate 3x increase in response body rows as of 2021_07_01 #124

Investigate 3x increase in response body rows as of 2021_07_01 #124

Comments

rviscomi commented Jul 28, 2021

tunetheweb commented Jul 28, 2021

tunetheweb commented Jul 28, 2021

rviscomi commented Jul 28, 2021

pmeenan commented Jul 28, 2021

pmeenan commented Jul 28, 2021

rviscomi commented Jul 28, 2021

pmeenan commented Jul 28, 2021

tunetheweb commented Jul 28, 2021

tunetheweb commented Jul 29, 2021

rviscomi commented Jul 29, 2021

pmeenan commented Jul 30, 2021

pmeenan commented Jul 30, 2021

rviscomi commented Aug 2, 2021