Inconsistent LDMS Metrics Set #1273
Unanswered
VanshIntel
asked this question in
Q&A
Replies: 2 comments 2 replies
-
Hi @VanshIntel, check the value of base->set. If this NULL, base_sample_begin/end don't do anything. But otherwise your code looks correct to me. |
Beta Was this translation helpful? Give feedback.
1 reply
-
@VanshIntel I think I didn't make my point clear. The code looks correct, so it could be a run-time issue that you can only check with gdb. Put a breakpoint in base_sample_begin and see if base->set is NULL. If it is, that explains why the set is inconsistent and then you would go poking around to see how/if base->set is being made NULL after config() |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
My team is developing and testing a LDMS sampler, gpumetrics sampler https://github.com/ovis-hpc/ovis/tree/OVIS-4/ldms/src/contrib/sampler/gpu_metrics_sampler on compute node and also aggregator node. We're trying to store the aggregated data to the csv file on aggregator node and push the aggregated data to Kafka bus.
I configure sampler.conf to collect meminfo and gpumetrics on compute node:
load name=meminfo
config name=meminfo producer=node001 instance=node001/meminfo
start name=meminfo interval=5000000
load name=gpumetrics
config name=gpumetrics producer=node001 instance=node001/gpumetrics
start name=gpumetrics interval=5000000
I start ldmsd daemon and run "ldms_ls -h localhost -x sock -p 10001 -l" on compute node. It shows the following result:
node001/meminfo: consistent
M u64 component_id 0
D u64 job_id 0
D u64 app_id 0
D u64 MemTotal 1056160644
D u64 MemFree 1030307584
D u64 MemAvailable 1027280492
node001/gpumetrics: inconsistent
D d64 gpu00_gpu_util 0.000000
D d64 gpu00_mem_util 0.043257
D u64 gpu00_mem_vram_used 59437056
The data value looks good. Upon comparing the results of the two samplers, we noted different status on the consistency for each. meminfo sampler shows 'consistent' message while gpumetrics shows 'inconsistent' message.
I configure aggregator.conf to aggregate the LDMS data on aggregator node, save data to csv file, and push data to kafka bus:
prdcr_add name=prdcr2 host=node001 port=10001 xprt=sock type=active interval=5000000
prdcr_start_regex regex=.*
updtr_add name=updtr1 interval=5000000 offset=200000
updtr_prdcr_add name=updtr1 regex=.*
updtr_start name=updtr1
load name=store_csv
config name=store_csv path=/tmp/LDMS_output
strgp_add name=gpumetrics_csv plugin=store_csv container=gpumetrics_store schema=gpumetrics
strgp_start name=gpumetrics_csv
strgp_add name=policy_meminfo_csv plugin=store_csv schema=meminfo container=memory_metrics
strgp_start name=policy_meminfo_csv
load name=cray_store_kafka
config name=cray_store_kafka container=kafka_plugin brokers=admin:9092 topic=ldms-monitoring action=init
strgp_add name=policy_gpumetrics_kfk plugin=cray_store_kafka container=kafka_plugin schema=gpumetrics
strgp_prdcr_add name=policy_gpumetrics_kfk regex=.*
strgp_start name=policy_gpumetrics_kfk
strgp_add name=policy_meminfo_kfk plugin=cray_store_kafka container=kafka_plugin schema=meminfo
strgp_prdcr_add name=policy_meminfo_kfk regex=.*
strgp_start name=policy_meminfo_kfk
"ldms_ls -h localhost -x sock -p 20001 -l" command on aggregator node shows data from both meminfo and gpumetrics samplers. But the data consistency status is same as compute node: meminfo is consistent while gpumetrics is inconsistent. In addition, for meminfo sampler, the system generates a csv file with valid data whereas for gpumetrics sampler, it generates a blank file. When we try to retrieve data from the Kafka bus, it streams only meminfo data, but there is no data for gpumetrics sampler.
The aggregator logs are as follows:
Fri Aug 25 14:33:10 2023: INFO : Set node001/meminfo oversampled 411 == 411.
Fri Aug 25 14:33:10 2023: DEBUG : Pushing set 0x7efe10004dd0 node001/meminfo
Fri Aug 25 14:33:15 2023: DEBUG : updtr_task sched '5000000': set 'node001/gpumetrics'
Fri Aug 25 14:33:15 2023: DEBUG : Schedule an update for set node001/gpumetrics
Fri Aug 25 14:33:15 2023: DEBUG : updtr_task sched '5000000': set 'node001/meminfo'
Fri Aug 25 14:33:15 2023: DEBUG : Schedule an update for set node001/meminfo
Fri Aug 25 14:33:15 2023: DEBUG : Update complete for Set node001/gpumetrics with status 0
Fri Aug 25 14:33:15 2023: INFO : Set node001/gpumetrics is inconsistent.
Fri Aug 25 14:33:15 2023: DEBUG : Pushing set 0x7efe10002c50 node001/gpumetrics
Fri Aug 25 14:33:15 2023: DEBUG : Update complete for Set node001/meminfo with status 0
I start wondering the "inconsistent" gpumetrics data is the cause that stops gpumetrics stored in CSV and published to Kafka. Some further research shows that the metric set will be inconsistent without base_sample_begin() and base_sample_end(). But I do confirm they are presented in the sampler code below.
https://github.com/ovis-hpc/ovis/blob/OVIS-4/ldms/src/contrib/sampler/gpu_metrics_sampler/gpu_metrics_ldms_sampler.c
base_sample_begin(base);
populateMetricSet(phDevices, numDevicesToSample, set, metric_offset);
base_sample_end(base);
Really appreciate your input to make the metrics data value consistent. Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions