Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is COUNT O(n) / storing counts in the R-Tree? #210

Open
thanatos opened this issue Aug 9, 2017 · 1 comment
Open

Is COUNT O(n) / storing counts in the R-Tree? #210

thanatos opened this issue Aug 9, 2017 · 1 comment

Comments

@thanatos
Copy link

thanatos commented Aug 9, 2017

This is related to #209, in that it's the same use case / testing.

Using COUNT on, e.g., a WITHIN query, seems to be ~O(n). E.g.,

127.0.0.1:9851> WITHIN addr LIMIT 5000000 WHERE beds 2 2 COUNT TILE 327 791 11
{"ok":true,"count":28758,"cursor":0,"elapsed":"305.070034ms"}
127.0.0.1:9851> WITHIN addr LIMIT 5000000 WHERE beds 2 2 COUNT TILE 163 395 10
{"ok":true,"count":43653,"cursor":0,"elapsed":"837.100618ms"}

(The second query is the containing tile of the first query; you can also see similar behavior in the timing output in the example output on #209.)

I think I was naïvely hoping that the R-Tree had a count of the number of things in each subtree, s.t. if an entire subtree fell within the target rectangle, that count could just be returned directly, instead of needing to iterate through the points in the subtree, which is what the timings suggest to me that Tile38 might be doing. (On the fringes of the input area, one might need to either subdivide further, or just outright iterate and test, but the point being that parts of the operation could potentially be sped up.)

I thought about this for a bit longer, and I don't think it's as simple as I first thought:

  • storing a count of the number of objects in a subtree doesn't really help if the query includes a WHERE
  • I'm not certain how storing a count would work for non-points/polygons; could those span an enclosing rectangle (and thus be at risk of being double-counted?) or does the R-Tree structure prevent this?

There's also of course the downside of the additional RAM usage for the counts.

(My use case here, and in #209, is that querying an area with a massive number of points is slow, simply due to the volume of data returned. It generally isn't useful to flood a (in my case, mobile) client w/ 1,000,000+ points, so the thinking was that we would, if the resultset was over a particular size, just return a count instead. (E.g., if <1000 results, get results normally, else get a count.) The client would need to drill in further / narrow their query. The thought was to implement this by first issuing a COUNT, on the hopes that that would be fast, but now, I'm not so certain. We're using WHERE, so even with the above, things are still slightly complicated.)

@tidwall
Copy link
Owner

tidwall commented Aug 10, 2017

Using COUNT on, e.g., a WITHIN query, seems to be ~O(n).

The R-tree is very quick at getting to the target rectangle O(log n). It then will need to iterate over each intersecting child node and finally the objects. The iterating part would be O(n) where n is the number of items intersecting the target rectangle.

I agree that 300ms seems a bit slow for filtering on 1,000,000 objects. I'll investigate and let you know what I find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants