Measure metrics latency #2323

fraillt · 2024-11-22T10:15:44Z

Tests how much time it takes to call .add(value, attribs) under various conditions.
It measures several things: average time, median time (p50), p95 and p99.
The way it works is pretty simple:

call .add(value, attribs) on several threads in the loop and measure each call
in main thread, keep collecting data (delta collect) with some delay between calls.
Here's the results on my machine on main branch.

See results.

*** updates, using 4 threads ***
no attribs
	iter 26.43M
	avg 103.0ns
	p50 57.00ns
	p95 177.0ns
	p99 182.0ns
1 attrib
	iter 8.267M
	avg 422.0ns
	p50 362.0ns
	p95 743.0ns
	p99 1.017μs
9 attribs
	iter 4.919M
	avg 718.0ns
	p50 615.0ns
	p95 854.0ns
	p99 1.026μs
*** inserts, using 4 threads ***
1 attrib
	iter 435.1K
	avg 8.931μs
	p50 958.0ns
	p95 64.16μs
	p99 113.8μs
10 attribs
	iter 119.0K
	avg 33.03μs
	p50 4.959μs
	p95 133.9μs
	p99 205.0μs
*** mix mostly updates (100 attribute-sets), using 4 threads ***
10 attribs
	iter 379.1K
	avg 10.36μs
	p50 1.729μs
	p95 67.73μs
	p99 125.9μs
*** updates, using 36 threads ***
no attribs
	iter 40.32M
	avg 682.0ns
	p50 215.0ns
	p95 228.0ns
	p99 243.0ns
1 attrib
	iter 11.72M
	avg 2.991μs
	p50 797.0ns
	p95 1.539μs
	p99 2.264μs
9 attribs
	iter 8.819M
	avg 3.952μs
	p50 933.0ns
	p95 1.442μs
	p99 1.797μs
*** inserts, using 36 threads ***
1 attrib
	iter 279.6K
	avg 128.2μs
	p50 2.757μs
	p95 792.6μs
	p99 1.596ms
10 attribs
	iter 80.01K
	avg 448.5μs
	p50 130.5μs
	p95 1.858ms
	p99 3.044ms
*** mix mostly updates (100 attribute-sets), using 36 threads ***
10 attribs
	iter 1.249M
	avg 28.57μs
	p50 945.0ns
	p95 13.10μs
	p99 871.9μs

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

codecov · 2024-11-22T10:19:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.5%. Comparing base (e0159ad) to head (8898ac5).

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #2323   +/-   ##
=====================================
  Coverage   79.5%   79.5%           
=====================================
  Files        123     123           
  Lines      21492   21492           
=====================================
  Hits       17094   17094           
  Misses      4398    4398

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cijothomas · 2024-11-23T02:46:26Z

stress/src/metrics_latency.rs

+    // collect threads in short intervals
+    let mut total_count = 0;
+    while threads.iter().any(|t| !t.is_finished()) {
+        // collect agressively so we could measure inserts properly,


trying to understand what this is simulating? Metric collection interval is by default 30 sec or 60 sec in the spec. What is the scenario for collecting every 500 microsec?

Basically idea, is to choose collection interval so it wouldn't affect numbers (p99) but still allow to test updates and inserts (for insert I need to clean it regularly, but not too frequent that updates would become inserts).
I updated code with explanation why this number... it could be different, but it should be good enough so it would work in all cases without affecting p99.
( Sorry for rebase,-force push, first commit was a bit rushed... I wasn't happy with, now it's better).

Sorry, I am not following the scenario being tested - are we expecting that users will run collect() every few microseconds? I don't think that is something we need to consider for performance optimizations.

Also, I opened #2328 to track revisiting the aggressive cleaning up of HashMap.

cijothomas · 2024-11-23T15:36:09Z

stress/src/metrics_latency.rs

+        }));
+    }
+    let mut total_count = 0;
+    while start.elapsed() < Duration::from_secs(1) {


If the goal is to measure the impact of Collect to overall throughput, lets add the existing PeriodicReader with a dummy exporter to the existing stress test. The PeriodicReader interval should be the defaults (60 secs) - or we can shorten to a low number like 5 sec, but definitely not micro-secs!

(I don't expect adding Reader to have any measurable impact to stress test.)

The goal is opposite,- collect should be invisible in the statistics, but keep in mind, that this number is almost arbitrary... and I should probably have better implementation, but for the starters it already shows meaningful numbers :)
So with 300micro-sec... it will clear around 1800-2500times per second (time to collect also adds up).
For p99, I care about 99% of measurements, and can ignore the rest. Which means, that for p99 it will have no effect on updates, if I measure more than 2500*99 =~250K/s, for my case I'm always over several millions, so collecting doesn't have any affect on statistics.
For inserts situation is different... I need to avoid hitting overflow (2k), which means that at most I can reach around 2M (which might be the case with better hardware...), so I basically hope that I clear frequently enough, to not hit 2k (overflow), but not too frequent, so that it will greatly affect p99 (but it probably affects in some scenarios... so this is not ideal).

Probably I should implement some sort of dynamic/adaptive interval, so that it wouldn't clean to frequently (so it wouldn't affect more than 1% of measurements for p99, but not too rarely to hit overflow (2k).
In other words, if I collect after less than 100, then it will affect p99, but if I collect after 2k, then for inserts it might reached overflow.

I could also implement different strategies for each case.

can you elaborate on the scenario being covered? Without understanding the actual scenario, its hard to provide useful feedback.

I update revision a bit. Generally, the purpose of this PR is to simply look how much time it takes for individual calls to measure things under various conditions, that's it :)
There's something to learn too :) For example, I have learned, that even simple update (no-attributes) under high load (lots of threads) can take as little as 60ns, or as much as 80ms, it's more than 1000000 time difference!
I still cannot explain why some times it takes so long... I tried to do "warmup" but, at least on my machine, this is consistently reproducible.

Thanks @fraillt for the great work, as always!
It is very important to ensure that the update time is predictable and never causes thread to get blocked etc. which will defeat the whole point of relying on OpenTelemetry to report metrics (if OTel itself can cause a thread to lose its CPU slice and mess up the duration!!). I don't think we'll fix this for 1.0 Metrics release, so I plan to call this out in docs. My current guess is.- some form of Bound instruments will solve this problem.

Let me re-check this PR to see if we want to keep this for measuring this scenario. The current behavior of aggressively draining the hash map is working against us, as that will force some updates to need write lock, effectively causing other threads to potentially get blocked...Something #2328 can address..

cijothomas · 2024-12-03T04:16:39Z

stress/src/metrics_latency.rs

+                KeyValue::new("url.scheme", "not_found"),
+                KeyValue::new("error.type", 404),
+                KeyValue::new("http.response.status_code", 404),
+                KeyValue::new("http.route", "testing/metrics/latency"),


one interesting thing is - the route may not be readily available as a `static str always, forcing users to String allocate to report this!
#2089 (comment) is trying to fix that scenario, hopefully we can make it happen for 1.0 stable of Metrics!

fraillt requested a review from a team as a code owner November 22, 2024 10:15

fraillt mentioned this pull request Nov 22, 2024

Metrics collect stress test #2247

Closed

4 tasks

This was referenced Nov 22, 2024

ValueMap - separate HashMap for sorted attribs #2288

Open

DRAFT: metrics sharding handwritten #2304

Open

cijothomas reviewed Nov 23, 2024

View reviewed changes

fraillt force-pushed the measure_metrics_latency branch from e0df038 to aa55bb8 Compare November 23, 2024 13:37

cijothomas reviewed Nov 23, 2024

View reviewed changes

Measure metrics latency

7058272

fraillt force-pushed the measure_metrics_latency branch 2 times, most recently from 3b7fbe8 to 7ae7647 Compare November 28, 2024 13:49

Adaptive collect interval

8898ac5

fraillt force-pushed the measure_metrics_latency branch from 7ae7647 to 8898ac5 Compare November 28, 2024 13:50

cijothomas reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure metrics latency #2323

Measure metrics latency #2323

fraillt commented Nov 22, 2024

codecov bot commented Nov 22, 2024 •

edited

Loading

cijothomas Nov 23, 2024

fraillt Nov 23, 2024

cijothomas Nov 23, 2024

cijothomas Nov 23, 2024

fraillt Nov 23, 2024 •

edited

Loading

cijothomas Nov 24, 2024

fraillt Nov 28, 2024

cijothomas Dec 3, 2024

cijothomas Dec 3, 2024

Measure metrics latency #2323

Are you sure you want to change the base?

Measure metrics latency #2323

Conversation

fraillt commented Nov 22, 2024

Merge requirement checklist

codecov bot commented Nov 22, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fraillt Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 22, 2024 •

edited

Loading

fraillt Nov 23, 2024 •

edited

Loading