Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(ArtifactsTest): IOTune parameters validation #10121

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

k0machi
Copy link
Contributor

@k0machi k0machi commented Feb 18, 2025

This commits adds a new subtest inside Artifacts test, that does the
check of machine image io_properties.yaml by comparing them to the
actual machine values and showing the deviation. The results are both
printed to console with deviation value and submitted to argus as a
GenericResultTable.

Fixes scylladb/qa-tasks#1787

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@k0machi k0machi self-assigned this Feb 18, 2025
@k0machi k0machi requested review from fruch and soyacz February 18, 2025 16:46
@k0machi
Copy link
Contributor Author

k0machi commented Feb 18, 2025

Demo:
image

17:42:50  < t:2025-02-18 16:42:49,047 f:iotune.py       l:55   c:sdcm.utils.validators.iotune p:INFO  > Disk performance values validation - testing /var/lib/scylla
17:42:50  < t:2025-02-18 16:42:49,047 f:iotune.py       l:59   c:sdcm.utils.validators.iotune p:INFO  > [/var/lib/scylla] read_iops: 109519 (-0)
17:42:50  < t:2025-02-18 16:42:49,047 f:iotune.py       l:59   c:sdcm.utils.validators.iotune p:INFO  > [/var/lib/scylla] read_bandwidth: 806913408 (6)
17:42:50  < t:2025-02-18 16:42:49,047 f:iotune.py       l:59   c:sdcm.utils.validators.iotune p:INFO  > [/var/lib/scylla] write_iops: 59110 (-3)
17:42:50  < t:2025-02-18 16:42:49,047 f:iotune.py       l:59   c:sdcm.utils.validators.iotune p:INFO  > [/var/lib/scylla] write_bandwidth: 559601600 (-0)
17:42:50  < t:2025-02-18 16:42:50,119 f:artifacts_test.py l:280  c:ArtifactsTest        p:INFO  > Verify XFS mount options for /var/lib/scylla contain `discard'

@@ -325,6 +326,14 @@ def test_scylla_service(self):
with self.subTest("check ENA support"):
assert self.node.ena_support, "ENA support is not enabled"

if ("gce" in backend or "aws" in backend or "azure" in backend):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it going to work with e.g. k8s-local-kind-aws?
I'd propose to use if backend in ("gce", "aws", "azure")
Also, shouldn't it be used only with predefined images (if params.get('use_preinstalled_scylla')?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I was testing and removed use_preinstalled_scylla, will put it back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

diff = (val / preset_val - 1) * 100
LOGGER.info("[%s] %s: %s (%.0f)", mountpoint, key, val, diff)

def _submit_results_to_argus(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done outside of IOTuneValidator which should do only one thing: validating correctness of predefined io_properties against actual values: all things like submitting or printing in console could be done separately to not blur purpose of this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll move submission to the tester body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to argus_results, kept the console output in the validator

Comment on lines 85 to 115
ValidationRules = {
"read_iops": ValidationRule(best_pct=10, fixed_limit=bottom_limit(preset_disk.get("read_iops"))),
"read_bandwidth": ValidationRule(best_pct=10, fixed_limit=bottom_limit(preset_disk.get("read_iops"))),
"write_iops": ValidationRule(best_pct=10, fixed_limit=bottom_limit(preset_disk.get("read_iops"))),
"write_bandwidth": ValidationRule(best_pct=10, fixed_limit=bottom_limit(preset_disk.get("read_iops"))),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thing best_pct rule is not right.
I'd propose different approach: instead of sending raw metrics, send differences in absolute value. This way higher_is_better could be set to False and fixed limit (calculated by predefined io_properties.yaml metric * x%). This should work for any instance size.
In that case table name should be renamed to f"{self.params.get('cluster_backend')} - {self.node.db_node_instance_type} IO Tune absolute deviation"
Drawback is only when we need to know exact value we would need to look into logs (could be tackled by publishing error event with full details in that case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can send both I think, and that's what I was thinking as well - I'll look into doing it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I think. We might want to sent the baseline as well.

Comment on lines 94 to 122
if tested_mountpoint != preset_disk["mountpoint"]:
LOGGER.warning("Disks differ - probably a mistake: %s vs %s, will not submit iotune results",
tested_mountpoint, preset_disk["mountpoint"])
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't fail silently here. Better raise error event

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k0machi k0machi force-pushed the artifacts-iotune-params-verification branch 4 times, most recently from 4e35601 to 9d35c02 Compare February 25, 2025 03:22
@k0machi k0machi requested a review from soyacz February 25, 2025 03:22
@k0machi k0machi force-pushed the artifacts-iotune-params-verification branch 4 times, most recently from c9cd843 to 677fede Compare February 25, 2025 03:39
This commits adds a new subtest inside Artifacts test, that does the
check of machine image io_properties.yaml by comparing them to the
actual machine values and showing the deviation

Fixes scylladb/qa-tasks#1787
@k0machi k0machi force-pushed the artifacts-iotune-params-verification branch from 677fede to 079bbc7 Compare February 25, 2025 03:39
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_iotune_results_to_argus function does validation and other unrelated to sending results to argus operations.
These should be part of validator and validator should prepare data for this function (this one should just send with minimum logic).

Comment on lines +297 to +302
if not preset_disk:
LOGGER.warning("Unable to continue - node should have io_properties.yaml, but it doesn't.")
TestFrameworkEvent(source="send_iotune_results_to_argus",
message="Unable to continue - node should have io_properties.yaml, but it doesn't.",
severity=Severity.ERROR).publish()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this don't belong here. Possibly could be moved to IOTuneValidator._read_io_properties

ColumnMetadata(name="write_bandwidth", unit="bps", type=ResultType.INTEGER, higher_is_better=True),
]

ValidationRules = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant

class IOPropertiesResultsTable(GenericResultTable):
class Meta:
name = f"{params.get('cluster_backend')} - {node.db_node_instance_type} Disk Performance"
description = "io_properties.yaml comparison with live data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's no longer comparison rather pure data from io_properties (row name should define 'image preset' or 'measured')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants