Skip to content
This repository has been archived by the owner on Jul 31, 2023. It is now read-only.

Show in-process child spans in zpages #782

Open
mwuertinger opened this issue Jun 6, 2018 · 9 comments
Open

Show in-process child spans in zpages #782

mwuertinger opened this issue Jun 6, 2018 · 9 comments

Comments

@mwuertinger
Copy link

I'm using OpenCensus in a Golang HTTP server with the trace.ProbabilitySampler. As far as I understand the sampling decision is made before the request processing starts and therefore it is currently impossible to influence the decision based on properties of the request outcome (eg. latency, status code). As I was told on Gitter this is currently per design as a tracing decision is usually made in the outer most service and then passed on to all the downstream services.

However, it would be helpful if one could influence the tracing decision within an application even after the request started in order to collect at least partial information about slow requests.

Is there anything planned in that regard?

@semistrict
Copy link
Contributor

Are you using z-pages? One of the features of the tracez page (although it's pretty rudimentary atm) is that you can see some basic info for all spans in a particular latency bucket (even those that were not sampled).

@mwuertinger
Copy link
Author

Didn't know that. Will have a look. Thanks :)

@semistrict
Copy link
Contributor

@mwuertinger what kinds of partial information would be useful?

I can think of a way we could pretty easily retain all child spans of a parent span that had high latency and/or errors. Using this, we could display the parent span representing the HTTP/gRPC request and then all the child spans representing e.g. database calls to service that request. It would only contain spans from the same process, not the full distributed trace. Is this what you had in mind?

@vaijab
Copy link

vaijab commented Jul 4, 2018

I just wonder how useful is the sampling that is based on worst-case scenarios? Wouldn't you lose track of what normal looks like?

@mwuertinger
Copy link
Author

@Ramonza From what I understand this is the best we can do at the moment. I think it would help in certain situations but I also have to agree with @vaijab that this would distort the overall picture of service health and might lead to the wrong conclusions. It's probably best to leave decisions like that to the individual teams.

I don't think there is public information available but I heard some time ago that Google's internal tracing system does have much smarter sampling decisions. Does anybody know more about that?

@g-easy
Copy link
Contributor

g-easy commented Jul 6, 2018

Yes, doing this would distort the overall picture because the sampling would no longer be uniform. We can mitigate that by annotating traces as "sampled uniformly" versus "sampled because something interesting happened."

The advantage of being able to get traces of slow operations is being able to debug why they were slow.

@semistrict
Copy link
Contributor

I agree that it's important to know whether something was sampled uniformly or not. This argument also applies to cases where tracing is explicitly requested from the client. Perhaps we should add an attribute that indicates the sampling policy & weight?

We do already have (in tracez) a way to see spans in each latency bucket. I think adding in-process child spans there might make that a lot more useful. They wouldn't be stored anywhere so wouldn't affect how representative the stored sample is.

@mwuertinger
Copy link
Author

@Ramonza Adding in-process child spans to tracez sounds like an excellent idea.

@semistrict semistrict changed the title Sampling decision based on request latency Show in-process child spans in zpages Jul 11, 2018
@semistrict
Copy link
Contributor

repurposing this issue

@rghetia rghetia added the P2 label May 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants