Skip to content

Commit

Permalink
Merge pull request apache#3246 from agresch/agresch_storm_3618
Browse files Browse the repository at this point in the history
STORM-3618 add meter to track scheduling errors
  • Loading branch information
Ethanlm authored Apr 9, 2020
2 parents 00f48d6 + 15d5872 commit 3eca57d
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 1 deletion.
3 changes: 2 additions & 1 deletion docs/ClusterMetrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ These are metrics that are specific to a nimbus instance. In many instances onl
|-------------|------|-------------|
| nimbus:files-upload-duration-ms | timer | Time it takes to upload a file from start to finish (Not Blobs, but this may change) |
| nimbus:longest-scheduling-time-ms | gauge | Longest time ever taken so far to schedule. This includes the current scheduling run, which is intended to detect if scheduling is stuck for some reason. |
| nimbus:mkAssignments-Errors | meter | tracks exceptions from mkAssignments |
| nimbus:num-activate-calls | meter | calls to the activate thrift method. |
| nimbus:num-added-executors-per-scheduling | histogram | number of executors added after a scheduling run. |
| nimbus:num-added-slots-per-scheduling | histogram | number of slots added after a scheduling run. |
Expand Down Expand Up @@ -102,7 +103,7 @@ These are metrics that are specific to a nimbus instance. In many instances onl
| nimbus:num-uploadChunk-calls | meter | calls to uploadChunk thrift method. |
| nimbus:num-uploadNewCredentials-calls | meter | calls to uploadNewCredentials thrift method. |
| nimbus:process-worker-metric-calls | meter | calls to processWorkerMetrics thrift method. |
| nimbus:mkAssignments-Errors | meter | tracks exceptions from mkAssignments |
| nimbus:scheduler-internal-errors | meter | tracks internal scheduling errors |
| nimbus:topology-scheduling-duration-ms | timer | time it takes to do a scheduling run. |
| nimbus:total-available-memory-non-negative | gauge | available memory on the cluster MB |
| nimbuses:uptime-secs | histogram | uptime of nimbuses |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ public class ResourceAwareScheduler implements IScheduler {
private int schedulingTimeoutSeconds;
private ExecutorService backgroundScheduling;
private Meter schedulingTimeoutMeter;
private Meter internalErrorMeter;

private static void markFailedTopology(User u, Cluster c, TopologyDetails td, String message) {
markFailedTopology(u, c, td, message, null);
Expand All @@ -78,6 +79,7 @@ private static void markFailedTopology(User u, Cluster c, TopologyDetails td, St
public void prepare(Map<String, Object> conf, StormMetricsRegistry metricsRegistry) {
this.conf = conf;
schedulingTimeoutMeter = metricsRegistry.registerMeter("nimbus:num-scheduling-timeouts");
internalErrorMeter = metricsRegistry.registerMeter("nimbus:scheduler-internal-errors");
schedulingPriorityStrategy = ReflectionUtils.newInstance(
(String) conf.get(DaemonConfig.RESOURCE_AWARE_SCHEDULER_PRIORITY_STRATEGY));
configLoader = ConfigLoaderFactoryService.createConfigLoader(conf);
Expand Down Expand Up @@ -235,6 +237,7 @@ private void scheduleTopology(TopologyDetails td, Cluster cluster, final User to
}
}
} catch (Exception ex) {
internalErrorMeter.mark();
markFailedTopology(topologySubmitter, cluster, td,
"Internal Error - Exception thrown when scheduling. Please check logs for details", ex);
return;
Expand Down

0 comments on commit 3eca57d

Please sign in to comment.