Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise Prometheus targets handling #2474

Merged
merged 5 commits into from
Feb 13, 2025
Merged

Conversation

thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Jan 22, 2025

PR Description

High level overview: This changes Targets to a capsule instead of a Go map in a way that is transparent to the users, but allows us to optimise the allocation and memory footprint of the targets handling in Prometheus pipelines.

NOTE: this PR appears large, however:

  • 1k is a copy of tests from upstream Prometheus for extra validation
  • 1.5k is new tests to ensure Target/TargetBuilder/equality work as expected
  • A lot of lines are changing the use of Targets throughout the codebase, in particular, various ways they're constructed in tests.

Which issue(s) this PR fixes

Fixes #1999

Notes to the Reviewer

Changes in more detail to facilitate the review:

  • Implemented a new Target which uses the underlying Prometheus LabelSets we get from discovery: the shared labels in group and target's own labels in own fields. This allows to avoid a lot of allocation.
  • Because two targets can have the same labels but they are split differently between their own and group fields, we need to use a dedicated Equals function. Implementing it is easy, but we also need to make sure that it is used by Alloy runtime (see below).
  • To allow customising Equals function in the runtime, I added the equality package which will attempt to compare using our CustomEquality interface where possible, but if that fails, it will use the previous reflect.DeepEquals. This package is now used throughout the runtime to avoid unnecessary updates etc.
  • Targets implement the necessary ConvertFrom and ConvertInto methods so the users still see them as maps when working with the UI and Alloy config files. The change should be fully transparent to the users. There are tests to verify this.
  • To implement ConvertFrom and ConvertInto we had to expose a couple of internal functions from the syntax package, but that's minimal and I think safe.
  • However, when encoding the capsules into Alloy syntax or into JSON (which is used in UI), the syntax package wasn't checking if given capsule can be converted in to an Alloy object and printed as such. This functionality was added - if we see a capsule that implements ConvertibleIntoCapsule - we know it can be converted into an object (which is a map), so we do that and we print it as such.
  • Targets are now immutable and changing them is done using TargetBuilder. This in turn uses similar method to Prometheus Labels to reduce allocation when adding or deleting labels.
  • Converting targets to the format required by Prometheus scrape is now faster, because we already use the same underlying structures. We further improve performance by reconstructing target groups where shared labels from group are passed on a group level. This reconstruction of target groups is done in discovery.ComponentTargetsToPromTargetGroups. We group them using the group labelset hash, but we also handle hash conflicts in case they are detected.
  • We have already some copy of Prometheus relabel code. I added the actual relabelling code to this so we can use our own, optimised label builder implementation. Also ported the tests to ensure we are 100% compliant.
  • Changed the distributed targets hashing to directly calculate hash instead of allocating labels first and then grabbing their hash.
  • Updated all the tests to use the new Targets and also the correct equality checks on them.

Performance:

Deployed to a large cluster between 10:30 and 13:00, we observed the following impact:

Memory and CPU down, we can also see smaller variation in memory usage which is good for HPA.
image

Allocation profile before:
image

Allocation profile after:
image

Looking at the diff, the reduction is, as expected, mostly in these two functions:
image

image

We can see the overall memory usage went down and back up as the changed was rolled out and then rolled back:
image

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

Sorry, something went wrong.

@thampiotr thampiotr force-pushed the thampiotr/targets-refactor-part1 branch from a575d60 to 82ce9f8 Compare January 22, 2025 16:14
@thampiotr thampiotr force-pushed the thampiotr/targets-refactor-part1 branch 2 times, most recently from dddfa4e to b39c1f9 Compare January 30, 2025 17:17
@thampiotr thampiotr changed the title WIP WIP: Optimise Prometheus targets handling Feb 3, 2025
@thampiotr thampiotr force-pushed the thampiotr/targets-refactor-part1 branch 3 times, most recently from b912766 to 2a0575b Compare February 4, 2025 15:19
internal/component/common/relabel/relabel.go Show resolved Hide resolved
package relabel

// LabelBuilder is an interface that can be used to change labels with relabel logic.
type LabelBuilder interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to use our own label builder with prometheus relabel logic. We may be able to contribute this upstream as it doesn't seem harmful.

internal/component/common/relabel/relabel.go Show resolved Hide resolved
internal/component/common/relabel/relabel_test.go Outdated Show resolved Hide resolved
@@ -214,6 +216,7 @@ func attemptLoadingAlloyConfig(t *testing.T, bb []byte) {
clusterService,
labelstore.New(nil, prometheus.DefaultRegisterer),
remotecfgService,
livedebugging.New(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some errors in log related to unknown service, so I added this. I think this test is attempting to start the config and find any config syntax related issues before the run inevitably fails.

comparedAndEqual = result{compared: true, isEqual: true}
)

func deepCustomEqual(v1, v2 reflect.Value) result {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted from reflect.DeepEqual.

# Conflicts:
#	CHANGELOG.md
@thampiotr thampiotr force-pushed the thampiotr/targets-refactor-part1 branch from 2788b05 to 872d35b Compare February 7, 2025 11:28
@thampiotr thampiotr changed the title WIP: Optimise Prometheus targets handling Optimise Prometheus targets handling Feb 7, 2025
Benchmark_Targets_TypicalPipeline-8 34 33537419 ns/op 6022333 B/op 100543 allocs/op
Benchmark_Targets_TypicalPipeline-8 34 33687083 ns/op 6109172 B/op 100543 allocs/op
*/
func Benchmark_Targets_TypicalPipeline(b *testing.B) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to current main we see:

benchstat ./01_typical_pipeline_targets_as_maps.txt  ./08_typical_pipeline_obj_pools.txt
goos: darwin
goarch: arm64
pkg: github.com/grafana/alloy/internal/component/discovery
cpu: Apple M2
                           │ ./01_typical_pipeline_targets_as_maps.txt │  ./08_typical_pipeline_obj_pools.txt  │
                           │                  sec/op                   │   sec/op     vs base                  │
_Targets_TypicalPipeline-8                                 38.63m ± 1%   33.61m ± 6%  -12.98% (p=0.000 n=10+6)

                           │ ./01_typical_pipeline_targets_as_maps.txt │  ./08_typical_pipeline_obj_pools.txt   │
                           │                   B/op                    │     B/op      vs base                  │
_Targets_TypicalPipeline-8                               61.953Mi ± 0%   5.743Mi ± 1%  -90.73% (p=0.000 n=10+6)

                           │ ./01_typical_pipeline_targets_as_maps.txt │  ./08_typical_pipeline_obj_pools.txt  │
                           │                 allocs/op                 │  allocs/op   vs base                  │
_Targets_TypicalPipeline-8                                 241.8k ± 0%   100.5k ± 0%  -58.42% (p=0.000 n=10+6)

@thampiotr thampiotr marked this pull request as ready for review February 7, 2025 11:55
@thampiotr thampiotr requested a review from mattdurham February 7, 2025 11:55
relabelled, keep := relabel.Process(lset, relabelConfigs...)
var (
relabelled discovery.Target
builder = discovery.NewTargetBuilderFrom(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder should/could be reused to avoid extra allocations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it should be since the content of the builder is mutated on each call to ProcessBuilder, if we can make it safe then seems reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the current profiles show good performance gains here already, I would like to leave this optimisation for future PRs after we have more data. I would like to consider an option of using an object pool for the couple of maps that the builder uses underneath as an alternative and compare the two options.

mustGet := func(label string) string {
val, ok := t.Get(label)
if !ok { // should never happen as Target is immutable
panic("label concurrently modified - this is a bug - please report an issue")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not panic but instead return an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Target labels are immutable and the only way that labels passed in order are no longer there is someone introducing a bug. I've removed the panic to not have crashes if this ever happened, but instead of returning an error, in order not to overcomplicate the API, I'm hashing an empty string - it will still provide consistent hashing behaviour and is a better choice than complicating things with error handling IMO.

Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions but overall very solid! Fantastic gains.

Copy link
Contributor Author

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comments, PTAL

internal/component/common/relabel/relabel.go Show resolved Hide resolved
internal/component/common/relabel/relabel.go Show resolved Hide resolved
internal/runtime/equality/equality.go Show resolved Hide resolved
internal/component/discovery/target.go Outdated Show resolved Hide resolved
mustGet := func(label string) string {
val, ok := t.Get(label)
if !ok { // should never happen as Target is immutable
panic("label concurrently modified - this is a bug - please report an issue")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Target labels are immutable and the only way that labels passed in order are no longer there is someone introducing a bug. I've removed the panic to not have crashes if this ever happened, but instead of returning an error, in order not to overcomplicate the API, I'm hashing an empty string - it will still provide consistent hashing behaviour and is a better choice than complicating things with error handling IMO.

internal/component/discovery/target.go Show resolved Hide resolved
internal/component/discovery/target.go Show resolved Hide resolved
relabelled, keep := relabel.Process(lset, relabelConfigs...)
var (
relabelled discovery.Target
builder = discovery.NewTargetBuilderFrom(t)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the current profiles show good performance gains here already, I would like to leave this optimisation for future PRs after we have more data. I would like to consider an option of using an object pool for the couple of maps that the builder uses underneath as an alternative and compare the two options.

Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, great work. My only remaining ask is to document the places we cant change the code due to backwards compatibility with prometheus.

@thampiotr thampiotr enabled auto-merge (squash) February 13, 2025 16:41
@thampiotr thampiotr merged commit 9f783e4 into main Feb 13, 2025
30 checks passed
@thampiotr thampiotr deleted the thampiotr/targets-refactor-part1 branch February 13, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce resource consumption of discovery components
5 participants