Is COUNT O(n) / storing counts in the R-Tree? #210

thanatos · 2017-08-09T18:39:14Z

This is related to #209, in that it's the same use case / testing.

Using COUNT on, e.g., a WITHIN query, seems to be ~O(n). E.g.,

127.0.0.1:9851> WITHIN addr LIMIT 5000000 WHERE beds 2 2 COUNT TILE 327 791 11
{"ok":true,"count":28758,"cursor":0,"elapsed":"305.070034ms"}
127.0.0.1:9851> WITHIN addr LIMIT 5000000 WHERE beds 2 2 COUNT TILE 163 395 10
{"ok":true,"count":43653,"cursor":0,"elapsed":"837.100618ms"}

(The second query is the containing tile of the first query; you can also see similar behavior in the timing output in the example output on #209.)

I think I was naïvely hoping that the R-Tree had a count of the number of things in each subtree, s.t. if an entire subtree fell within the target rectangle, that count could just be returned directly, instead of needing to iterate through the points in the subtree, which is what the timings suggest to me that Tile38 might be doing. (On the fringes of the input area, one might need to either subdivide further, or just outright iterate and test, but the point being that parts of the operation could potentially be sped up.)

I thought about this for a bit longer, and I don't think it's as simple as I first thought:

storing a count of the number of objects in a subtree doesn't really help if the query includes a WHERE
I'm not certain how storing a count would work for non-points/polygons; could those span an enclosing rectangle (and thus be at risk of being double-counted?) or does the R-Tree structure prevent this?

There's also of course the downside of the additional RAM usage for the counts.

(My use case here, and in #209, is that querying an area with a massive number of points is slow, simply due to the volume of data returned. It generally isn't useful to flood a (in my case, mobile) client w/ 1,000,000+ points, so the thinking was that we would, if the resultset was over a particular size, just return a count instead. (E.g., if <1000 results, get results normally, else get a count.) The client would need to drill in further / narrow their query. The thought was to implement this by first issuing a COUNT, on the hopes that that would be fast, but now, I'm not so certain. We're using WHERE, so even with the above, things are still slightly complicated.)

The text was updated successfully, but these errors were encountered:

tidwall · 2017-08-10T14:31:40Z

Using COUNT on, e.g., a WITHIN query, seems to be ~O(n).

The R-tree is very quick at getting to the target rectangle O(log n). It then will need to iterate over each intersecting child node and finally the objects. The iterating part would be O(n) where n is the number of items intersecting the target rectangle.

I agree that 300ms seems a bit slow for filtering on 1,000,000 objects. I'll investigate and let you know what I find.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is COUNT O(n) / storing counts in the R-Tree? #210

Is COUNT O(n) / storing counts in the R-Tree? #210

thanatos commented Aug 9, 2017

tidwall commented Aug 10, 2017

Is COUNT O(n) / storing counts in the R-Tree? #210

Is COUNT O(n) / storing counts in the R-Tree? #210

Comments

thanatos commented Aug 9, 2017

tidwall commented Aug 10, 2017