Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure 100% table coverage in BigQuery #36

Closed
rviscomi opened this issue Apr 3, 2018 · 6 comments
Closed

Ensure 100% table coverage in BigQuery #36

rviscomi opened this issue Apr 3, 2018 · 6 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented Apr 3, 2018

https://discuss.httparchive.org/t/missing-2016-02-15-chrome-requests/1310 is a bug report that some 2016_02_15 tables are missing.

We should take inventory of all tables across all dates and reprocess anything that's missing.

This can be a good first bug for first time contributors. Overview of the expected workflow:

  • use the bq command line interface to list the contents of each dataset
  • export results to a spreadsheet
    • graph the results to make it obvious if there are any gaps
  • or write a script to check if any YYYY_MM_[01, 15] tables are missing
    • some early tables are not necessarily DD=[01, 15]
  • ignore tables that are expected to be missing, eg lighthouse.YYYY_MM_DD_desktop, or others missing as a result of known data loss bugs (citation needed)
@paulcalvano
Copy link

I’ll work on this one.

@rviscomi
Copy link
Member Author

Thanks Paul!

@paulcalvano
Copy link

paulcalvano commented May 1, 2018

Using the following query to extract all of this data. I'm looking at row counts by table names in each dataset -

SELECT *  FROM httparchive.har.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.lighthouse.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.pages.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.requests.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.response_bodies.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.runs.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.summary_pages.__TABLES__ 
UNION ALL 
SELECT *  FROM httparchive.summary_requests.__TABLES__ 

Here's a graphical summary of the gaps for all the runs, summary_pages and summary_requests datasets.

image

In case the descriptions are difficult to read, here's a summary of the gaps -

(1) 12/1/2013: missing data for all summary tables
(2) 10/15/2014: requests data table is twice as large as previous months
(3) 1/15/2016: requests data table is missing data
(4) 12/15/2016: requests tables are missing data
              : requests_mobile table is missing
(5) 01/01/2017: missing requests and requests_mobile data
    01/15/2017: missing requests and requests_mobile data	
(6) 11/15/2017: missing data in requests_mobile
(7) 02/15/2017-06/15/2017: missing desktop pages summary

I'll work on the har, lighthouse, pages and requests datasets next.

@paulcalvano
Copy link

paulcalvano commented May 1, 2018

The har, lighthouse, pages and requests datasets do not appear to have any gaps between the old and new datasets. There are a few interesting things to note:

  • Most of this data has major gaps on 10/15/2016 and 11/1/2016
  • Starting around 7/15/2017 and ending by 02/01/2018 there was a drop in mobile response_bodies (request_bodies in the older dataset)

image

@rviscomi
Copy link
Member Author

rviscomi commented May 1, 2018

This is awesome thanks for compiling it, Paul. I'll work on rerunning the pipeline for any missing tables.

@rviscomi rviscomi self-assigned this May 1, 2018
@paulcalvano
Copy link

Cool, thanks. Let me know when it’s done and I can update the visualization with the latest table data to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants