Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time window in metric name #326

Open
cazeaux opened this issue Jun 1, 2022 · 3 comments
Open

Time window in metric name #326

cazeaux opened this issue Jun 1, 2022 · 3 comments
Assignees

Comments

@cazeaux
Copy link

cazeaux commented Jun 1, 2022

Hello

Sloth has made the choice to add the time window in the metric name, like slo:sli_error:ratio_rate2h.
But when I use sloth to manage a large amount of SLI, with different SLO and time windows, it appears that having the time window complexifies the management of the metrics.

Problem statement

With a large number of SLI/SLO with different time window, we arrive to a situation where some metrices are not applicable to all SLI, and it is impossible to learn from the metrics which can be used or not.

To simplify, consider the case

  • SLI1 with SLO windows 2h + 4h
  • SLI2 with SLO windows 4h + 1d

It will generate 3 metrics:

  • slo:sli_error:ratio_rate2h
  • slo:sli_error:ratio_rate4h
  • slo:sli_error:ratio_rate1d

I have no way to know that SLI1 has only 2h and 4h by looking at the metrics only.
And I have no way to request label values to know the windows available for a given SLI.

Proposition

I believe that the choice has been done based on prometheus rules best practices, but I think that there is a kind of inconsistence.

According to the practice, a metrice should be named level:metric:operations, where the metric is the name of the metric on which the operation is made. So it makes sense to have the time window in the metric name when it is clearly linked to a given metric, when for another metric we could have other time windows.
But for sloth, the metric is always sli_error so we lose part of the practice rationale, as we are mixing a generic part (the metric) and a specific part (the time window).

So I can propose two solutions:

The most simple: simply remove the time window to keep consistent metric names. Because the label sloth_window is already set, we have nothing to add. Furthermore, thanks the the generic name it becomes possible to learn the available windows with label_values(slo:sli_error:ratio_rate{sloth_service="my-service", sloth_slo="my-slo"}, sloth_window) request.

Maybe more complex, but more compliant to the prometheus pracices: we replace sli_error by a name pointing on the SLI itself. Thus by construction the time window in the metric name remain specific to the SLI. As name, we can use the sloth_service + sloth_slo. For example: slo:my_service:my_slo:ratio_rate2h
This solution still requires to add a way to discover the available windows.

What is your opinion on that ?

@slok
Copy link
Owner

slok commented Jun 19, 2022

Hi @cazeaux

Thanks for the issue.

Indeed, I decided to break the best practices of Prometheus recording rules in favor of UX. One of the main problems that I saw in all Prometheus SLO-related implementations was this, so... I decided to break the rule and have an easier-to-use and discovery.

Why it's inconsistent? well, mainly because at the beginning Sloth only had the 30d window support and I thought that would be enough in favor of simplicity, however, after the huge success of Sloth and the different Sloth user use cases, I decided to add support for more windows. And here we are! so there isn't a good reason for it, just historical reasons (and now kind of tech debt).

So, you are right, we should make something about it, but I'm postponing this because this will break all Sloth installations...

The way I wanted to fix this is exactly the one that you propose first (the most simple), so we are aligned in this sense 😄

@slok slok self-assigned this Jun 19, 2022
@cazeaux
Copy link
Author

cazeaux commented Aug 24, 2022

Hello

To avoid breaking existing installations, can we add this as an option (like it was done for the optimized rules) ?

@tokheim
Copy link

tokheim commented Jan 24, 2024

Just want to offer you a partial solution that might help. Metric names are actually treated as labels in prometheus. So you can do an expression like

{__name__=~"slo:sli_error:ratio_rate.*",sloth_service="my-service", sloth_slo="my-slo"}

to pick up all the windows for a given slo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants