Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from flux-framework:master #38

Open
wants to merge 9,570 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
9570 commits
Select commit Hold shift + click to select a range
481f3ec
testsuite: expand default frobnicator testing
grondo Oct 31, 2024
0fe32a4
Merge pull request #6403 from grondo/issue#6400
mergify[bot] Nov 1, 2024
d9f61f4
t: fix inconsistent tabbing
chu11 Oct 29, 2024
f95b205
flux-exec: use zlistx_t over zlist_t
chu11 Oct 28, 2024
2b403a9
flux-exec: use stdin flow control
chu11 Oct 8, 2024
fe8768b
flux-exec: disable stdin flow for sdexec
garlick Oct 28, 2024
4aea6bb
t: cover flux-exec stdin flow control
chu11 Oct 29, 2024
718a9ba
Merge pull request #6370 from chu11/issue4572_flux_exec
mergify[bot] Nov 2, 2024
4e8ae34
perilog: ensure job doesn't start when prolog cancel times out
grondo Nov 3, 2024
10409f7
perilog: fix error message on prolog cancel timeout
grondo Nov 3, 2024
19e4ae1
testsuite: test prolog timeout after cancel
grondo Nov 3, 2024
dbb0800
Merge pull request #6412 from grondo/prolog-cancel-timeout-fix
mergify[bot] Nov 3, 2024
ebe9118
libsubprocess: close extra file descriptors
garlick Nov 5, 2024
f8a1e8f
testsuite: check for extra subprocess fds
garlick Nov 5, 2024
9411280
Merge pull request #6416 from garlick/issue#6415
mergify[bot] Nov 5, 2024
649e65f
build: require flux-security >= 0.13.0
garlick Nov 4, 2024
ec7098c
ci: specify flux-security-0.13.0
garlick Nov 4, 2024
ab6a1ac
job-exec: don't use flux imp kill
garlick Nov 1, 2024
a891285
job-exec: send SIGUSR1 to the IMP, not SIGKILL
garlick Nov 1, 2024
9fdfebe
job-manager: don't use flux imp kill
garlick Nov 1, 2024
cef76a7
flux-exec: don't use flux imp kill
garlick Nov 4, 2024
f6cf51d
libsubprocess: drop bulk_exec_set_imp_path()
garlick Nov 3, 2024
0afe5a3
Merge pull request #6408 from garlick/issue#6406
mergify[bot] Nov 6, 2024
58227a1
broker: fix state machine create error path
garlick Oct 29, 2024
b8819e8
broker: make timeout config function reusable
garlick Oct 29, 2024
f5fc0e6
broker: add optional cleanup timeout on signal
garlick Oct 29, 2024
846e6da
systemd: set broker.cleanup-timeout=45
garlick Oct 29, 2024
86a31be
testsuite: cover broker.cleanup-timeout
garlick Oct 29, 2024
97bf2b9
flux-broker-attributes(7): add cleanup-timeout
garlick Oct 29, 2024
ff63621
Merge pull request #6397 from garlick/issue#6388
mergify[bot] Nov 6, 2024
37154e1
docs: rfc flux-config-bootstrap diagram
vsoch Nov 3, 2024
3ede2d8
Merge pull request #6411 from researchapps/add-topology-diagram
mergify[bot] Nov 6, 2024
8ee04d5
NEWS.md: add release notes for v0.68.0
garlick Nov 6, 2024
0e7cecb
Merge pull request #6417 from garlick/news68
mergify[bot] Nov 6, 2024
722d786
job-manager/history: optimize list insertion
chu11 Nov 8, 2024
3ad7986
Merge pull request #6422 from chu11/issue6419_job_manager_history_insert
mergify[bot] Nov 8, 2024
a802b36
libutil: add timestamp_tzoffset()
grondo Nov 8, 2024
ad0fa95
libutil: add invalid argument tests for timestamp_tzoffset()
grondo Nov 8, 2024
035046a
libeventlog: formatter: output ISO timestamps in localtime+offset
grondo Nov 8, 2024
066ec11
flux-dmesg: convert ISO timestamp to localtime+offset in output
grondo Nov 8, 2024
160fea4
flux-dmesg: pass address of stdlog_header to print functions
grondo Nov 8, 2024
1c0757c
doc: fix missing colon in flux-dmesg(1)
grondo Nov 8, 2024
9cd1468
testsuite: test flux-dmesg timestamp timezone rendering
grondo Nov 8, 2024
bf7fb79
testsuite: test `flux job eventlog` timezone formatting
grondo Nov 8, 2024
e1e59d7
Merge pull request #6423 from grondo/issue#6421
mergify[bot] Nov 10, 2024
ccd6850
libmissing: add json_object_update_recursive()
grondo Nov 10, 2024
3773531
libflux: update config tables recursively in flux_conf_parse(3)
grondo Nov 10, 2024
cd6002f
libflux/test: test that flux conf tables update recursively
grondo Nov 10, 2024
1cd76a3
Merge pull request #6424 from grondo/issue#6396
mergify[bot] Nov 11, 2024
bd03861
doc: launch jobs with systemd in the admin guide
garlick Nov 12, 2024
80527ea
Merge pull request #6427 from garlick/admin_guide_systemd
mergify[bot] Nov 12, 2024
81928cb
doc: drop EXPERIMENTAL tag from housekeeping
garlick Nov 11, 2024
4adf007
doc: add TROUBLESHOOTING to flux-housekeeping(1)
garlick Nov 11, 2024
82cf509
doc: expand perilog admin guide section
garlick Nov 11, 2024
6689625
Merge pull request #6425 from garlick/doc_updates
mergify[bot] Nov 12, 2024
edbb9c1
python: indicate held jobs in JobInfo.contextual_info field
grondo Nov 12, 2024
ca1be94
doc: update description of `contextual_info` in flux-jobs(1)
grondo Nov 12, 2024
c61dfca
testsuite: test held jobs and INFO field
grondo Nov 12, 2024
83d7b57
Merge pull request #6430 from grondo/issue#6426
mergify[bot] Nov 13, 2024
c4df8a8
perilog: raise prolog kill timeout to 1m
garlick Nov 13, 2024
36998a1
flux-config-job-manager(5): new dflt kill-timeout
garlick Nov 13, 2024
2af9ff3
Merge pull request #6431 from garlick/issue#6420
mergify[bot] Nov 13, 2024
05a6d01
python: call shutdown() on executor in job validator
grondo Nov 13, 2024
6f66ccc
Merge pull request #6435 from grondo/issue#6434
mergify[bot] Nov 13, 2024
1bd1fa4
libflux: fix whitespace issues
garlick Nov 13, 2024
46f41dd
libflux: clean up includes
garlick Nov 13, 2024
f800897
libflux: move reactor_get/set to handle class
garlick Nov 13, 2024
1efba39
libflux: move reactor ops out of public API
garlick Nov 13, 2024
a10c0c0
libflux: export reactor incref/decref
garlick Nov 13, 2024
5b8fe20
libflux: drop flux_ from private reactor functions
garlick Nov 13, 2024
6761eb0
libflux: drop extern "C" {} from private header
garlick Nov 13, 2024
976dd74
libflux: add flux_watcher_is_active(3)
grondo May 31, 2024
e8d88a2
testsuite: cover flux_watcher_is_active(3)
grondo May 31, 2024
bcdc198
doc: document flux_watcher_is_active(3)
grondo May 31, 2024
1de307c
Merge pull request #6436 from garlick/is_active
mergify[bot] Nov 14, 2024
5cf23d6
job-ingest: show trashed subprocesses in stats
garlick Nov 14, 2024
266f510
job-ingest: clean up trashed workers
garlick Nov 14, 2024
95da25e
testsuite: cover stuck validator at module unload
garlick Nov 14, 2024
ae1f516
Merge pull request #6438 from garlick/issue#6432
mergify[bot] Nov 14, 2024
9b5e587
cmd: remove flux-perilog-run and associated tests
grondo Nov 18, 2024
a3f94b6
testsuite: add missing perilog tests
grondo Nov 18, 2024
8425e3c
Merge pull request #6447 from grondo/issue#6428
mergify[bot] Nov 18, 2024
a0ffd87
job-manager: rename queue context struct
garlick Nov 18, 2024
ace9969
job-manager: rename queue class
garlick Nov 18, 2024
8557ff0
job-manager: split simple dual-purpose functions
garlick Nov 18, 2024
cfc56b6
job-manager: clean up queue_ctx_save()
garlick Nov 18, 2024
20cef20
job-manager: rename queue structure members
garlick Nov 18, 2024
3a779fe
job-manager: add descriptive comment for queues
garlick Nov 18, 2024
7d308c9
Merge pull request #6448 from garlick/queue_cleanup
mergify[bot] Nov 18, 2024
20e0d1f
librlist: localize hwloc dependency in one file
garlick Nov 19, 2024
e1f2183
build: eliminate extraneous hwloc linkage
garlick Nov 19, 2024
03b0860
Merge pull request #6450 from garlick/rlist_hwloc
mergify[bot] Nov 19, 2024
854bab5
deb: require flux-security >= 0.13
garlick Nov 20, 2024
f32be6f
build: increase verbosity of flux-security check
garlick Nov 20, 2024
94ded4e
libschedutil: don't export private functions
garlick Nov 20, 2024
e2ef9fe
Merge pull request #6451 from garlick/build_cleanup
mergify[bot] Nov 20, 2024
0f1a219
flux-job: save whole event in attach ctx->last_event
grondo Nov 14, 2024
ed1d474
flux-job: fix `flux job attach` statusline
grondo Nov 14, 2024
e8505d4
flux-job: attach status doesn't handle queue updates
grondo Nov 14, 2024
a6ea96e
flux-job: attach: improve output of exception with status line
grondo Nov 15, 2024
ac7b609
testsuite: fix skipped test in t2500-job-attach.t
grondo Nov 14, 2024
34deaff
testsuite: test `flux job attach` handling of nonfatal exception
grondo Nov 15, 2024
d00a449
Merge pull request #6442 from grondo/issue#6314
mergify[bot] Nov 21, 2024
a0137fe
use portable _POSIX_HOST_NAME_MAX
garlick Nov 21, 2024
27c18cb
handle missing link.h
garlick Nov 21, 2024
655c325
libutil: fix compile warnings in fdwalk on non-linux
garlick Nov 21, 2024
b55b735
libmissing: add pipe2
garlick Nov 21, 2024
d5b0ece
libmissing: add mempcpy
garlick Nov 21, 2024
7bf3bb3
build: add missing JANSSON_CFLAGS
garlick Nov 21, 2024
645ee7d
libutil: fix sigutil portability issues
garlick Nov 21, 2024
ec9d92d
Use EDEADLK when EDEADLOCK is unavailable
garlick Nov 21, 2024
7a358f9
build: handle epoll() via libepoll-shim
garlick Nov 21, 2024
6e3bc87
build: add scripts for macos
garlick Nov 21, 2024
01d3650
Merge pull request #6454 from garlick/macos
mergify[bot] Nov 21, 2024
51ef24c
flux-start: add common broker args function
grondo Nov 21, 2024
0644adb
flux-start: simplify argument addition in client_create()
grondo Nov 21, 2024
860a639
flux-start: fix whitespace issues
grondo Nov 21, 2024
0892032
flux-start: add prepend argument to add_args_list()
grondo Nov 21, 2024
90e8b19
flux-start: add `-S, --setattr=` option
grondo Nov 21, 2024
78f57cb
flux-start: add `-c, --config-path` option
grondo Nov 21, 2024
44ed42d
flux-start: fix argument truncation in add_argzf()
grondo Nov 21, 2024
7e89c3b
doc: document `-S` and `-c` in flux-start(1)
grondo Nov 21, 2024
89c2016
testsuite: update flux-start usage
grondo Nov 21, 2024
607c57e
Merge pull request #6452 from grondo/flux-start-newargs
mergify[bot] Nov 22, 2024
640631e
doc: improve `--include` documentation in flux-resource(1)
grondo Nov 22, 2024
af5feae
Merge pull request #6459 from grondo/resource-list-doc
mergify[bot] Nov 22, 2024
ab1f6ec
kvs-watch: fix indentation
chu11 Nov 6, 2024
d5d222f
kvs-watch: remove duplicate parse of root_seq
chu11 Nov 7, 2024
689ac7c
kvs-watch: add note on out of order lookups
chu11 Nov 13, 2024
fcfcbdc
kvs-watch: improve error messages
chu11 Nov 21, 2024
0047178
kvs-watch: avoid error response on failed rpc
chu11 Nov 21, 2024
8015f88
Merge pull request #6458 from chu11/kvs_watch_cleanup
mergify[bot] Nov 22, 2024
69dc453
libfileref: fix segfault for files >2G
garlick Nov 28, 2024
9c3bb07
Merge pull request #6462 from garlick/issue#6461
mergify[bot] Nov 28, 2024
8030324
shell: add executable name to doom exceptions
chu11 Nov 21, 2024
57f5124
flux-job: output jobid with exception
chu11 Nov 22, 2024
3acf1d8
Merge pull request #6453 from chu11/issue6357_doom_exception
mergify[bot] Dec 3, 2024
eb98dd9
cmd: add `--skip-empty` option to `flux resource list`
grondo Nov 22, 2024
5da7b44
cmd: skip empty lines with `flux resource list --include`
grondo Nov 22, 2024
3ba4a19
testsuite: test flux-resource skip-empty behavior
grondo Nov 22, 2024
19ded0f
etc: update bash completions for flux-resource
grondo Nov 22, 2024
c2e7203
doc: update flux-resource(1)
grondo Nov 22, 2024
d61eccb
Merge pull request #6460 from grondo/issue#6275
mergify[bot] Dec 3, 2024
54e4f19
libfluxutil: improve rusage query
garlick Nov 22, 2024
0bdd3d2
flux-module: add stats --rusage=[WHO]
garlick Dec 3, 2024
1137adb
flux-module(1): document stats --rusage=[WHO]
garlick Dec 3, 2024
e2a50f0
testsuite: cover flux module stats --rusage=who
garlick Dec 3, 2024
9f43e6b
Merge pull request #6471 from garlick/rusage
mergify[bot] Dec 4, 2024
2ea0d21
python: support `wait` parameter to jobid URI resolver
grondo Nov 14, 2024
d975fe1
doc: document `flux uri --wait`
grondo Nov 15, 2024
dabc2b9
testsuite: test `flux uri --wait JOBID`
grondo Nov 15, 2024
435756d
Merge pull request #6443 from grondo/job-uri-wait
mergify[bot] Dec 4, 2024
fd4dd44
libev: disable timerfd on macos
garlick Nov 22, 2024
328cd30
add missing AM_CPPFLAGS include directories
garlick Nov 22, 2024
3b90882
cast %j arguments to [u]intmax_t
garlick Nov 22, 2024
e9f00f5
liboptparse: update getopt non-gnu code
garlick Nov 22, 2024
9864628
libmissing/envz: fix includes
garlick Nov 22, 2024
f72c844
define environ as an extern
garlick Nov 22, 2024
6029a69
include signal.h for kill(2), killpg(2)
garlick Nov 22, 2024
4bc8d63
work around missing SOCK_CLOEXEC
garlick Nov 22, 2024
67f8663
librouter: use LOCAL_PEERCRED if SO_PEECRED is missing
garlick Nov 22, 2024
9161298
kvs: use EINVAL instead of EBADE if undefined
garlick Nov 22, 2024
27de281
Merge pull request #6468 from garlick/macos2
mergify[bot] Dec 4, 2024
7457a7d
NEWS.md: add release notes for v0.69.0
grondo Dec 3, 2024
11ad605
Merge pull request #6472 from grondo/news-0.69
mergify[bot] Dec 4, 2024
dd9bc1f
libmissing: add get_current_dir_name()
garlick Nov 22, 2024
cd539bb
handle missing prctl(2)
garlick Nov 22, 2024
454731e
handle pthread_setname_np() without tid argument
garlick Nov 22, 2024
cb97929
cast timeval.tv_usec to long
garlick Nov 22, 2024
16fe50a
skip GLOB_TILDE_CHECK if unavailable
garlick Nov 22, 2024
94ada76
support BSD ptrace request names
garlick Nov 22, 2024
78cb74d
skip unportable rlimit resource names
garlick Nov 22, 2024
bb935a7
avoid duplicate basename(3) calls
garlick Nov 22, 2024
27abc8c
add basename_simple()
garlick Nov 22, 2024
01f3097
Merge pull request #6473 from garlick/macos3
mergify[bot] Dec 4, 2024
ecfe518
drop caliper support
garlick Dec 4, 2024
861c5cf
Merge pull request #6475 from garlick/drop_caliper
mergify[bot] Dec 5, 2024
143a140
flux-hostlist: allow `-n, --nth` to take an idset
grondo Dec 5, 2024
fe38f27
flux-hostlist: allow `-x, --exclude` to take an idset
grondo Dec 5, 2024
78860bd
flux-hostlist: allow --nth to used with --expand and --count
grondo Dec 5, 2024
70c079f
testsuite: update flux-hostlist tests
grondo Dec 5, 2024
974dda0
doc: update flux-hostlist(1)
grondo Dec 5, 2024
f761aec
Merge pull request #6478 from grondo/flux-hostlist-tweaks
mergify[bot] Dec 5, 2024
fa38626
build: use libtool -export-symbols-regex
garlick Nov 22, 2024
407445c
build: use libtool -no-undefined
garlick Nov 22, 2024
3a87534
build: abstract -Wl,--gc-sections
garlick Nov 23, 2024
e530c61
build: fix epoll detection
garlick Nov 23, 2024
3eb3edf
build: define symbol instead of using --defsym
garlick Nov 23, 2024
d7937aa
build: link libflux-taskmap.so with libmissing
garlick Nov 23, 2024
2e9bcea
build: link content broker module with libarchive
garlick Nov 23, 2024
4a17c06
valgrind: suppress libuuid TLS leak
garlick Dec 4, 2024
72ae1f0
Merge pull request #6476 from garlick/macos4
mergify[bot] Dec 5, 2024
704fc9e
flux-job: fflush eventlog entries
chu11 Nov 14, 2024
4519ea9
libkvs: support treeobj_type_name function
chu11 Dec 4, 2024
111e397
kvs-watch: break long parameter lists
chu11 Dec 4, 2024
1cdcae8
kvs-watch: only fetch new data for appends
chu11 Nov 7, 2024
7f037f7
doc: update FLUX_KVS_WATCH_APPEND description
chu11 Nov 14, 2024
c39597d
t: update kvs watch append tests
chu11 Nov 14, 2024
acfad8d
t: add new kvs watch append tests
chu11 Nov 15, 2024
0cfb291
Merge pull request #6444 from chu11/issue6414_kvs_watch_optimization
mergify[bot] Dec 5, 2024
a288986
adjust whitespace in source code
garlick Nov 30, 2024
2e7b064
Merge pull request #6481 from garlick/cosmetic_cleanup
mergify[bot] Dec 5, 2024
ff292bf
shell/oom: skip building when inotify is unavailable
garlick Nov 23, 2024
8a9d1ad
flux: fix error message
garlick Nov 23, 2024
169cc01
libutil/intree: port to macos
garlick Nov 23, 2024
0ca8125
testsuite: allow fdwalk test to work on non-linux
garlick Nov 23, 2024
961c539
testsuite: avoid non-portable pthreads dependency
garlick Dec 3, 2024
3503560
libmissing: fix pipe2 implementation
garlick Dec 5, 2024
756c1ed
work around missing SOCK_NONBLOCK
garlick Dec 5, 2024
0872748
include signal.h for kill(2)
garlick Dec 5, 2024
044fd2f
define environ as an extern
garlick Dec 5, 2024
d653f20
use libmissing get_current_dir_name()
garlick Dec 5, 2024
e7d7bb3
build: add missing JANSSON_CFLAGS
garlick Dec 5, 2024
a941e1f
cast %j arguments to [u]intmax_t
garlick Dec 5, 2024
02cae96
Merge pull request #6479 from garlick/macos5
mergify[bot] Dec 5, 2024
3a383bb
resource: only read resource.scheduling config on rank 0
grondo Dec 5, 2024
01c7e5c
Merge pull request #6482 from grondo/issue#6480
mergify[bot] Dec 6, 2024
c46bdb6
shell: ignore SIGPIPE
grondo Dec 6, 2024
5fe12df
testsuite: add test for job shell handling of SIGPIPE
grondo Dec 6, 2024
7cd6705
Merge pull request #6489 from grondo/issue#6487
mergify[bot] Dec 6, 2024
ff1a93f
build: use -Wl,--gc-sections when appropriate
garlick Dec 10, 2024
3cae472
Merge pull request #6497 from garlick/issue#6496
mergify[bot] Dec 10, 2024
d51397b
testsuite: use NSIG in job manager kill test
garlick Dec 10, 2024
54291e9
testsuite: remap EBADE if undefined
garlick Dec 10, 2024
86b9dad
testsuite: define 'environ' for macos
garlick Dec 10, 2024
21c8e19
ci: add macos-12 build only test
garlick Dec 10, 2024
0af65c5
Merge pull request #6499 from garlick/macos_builder
mergify[bot] Dec 10, 2024
9beaab7
jobtap: add flux_jobtap_jobspec_update_id_pack()
grondo Dec 10, 2024
c5d9fda
testsuite: test flux_jobtap_jobspec_update_id_pack()
grondo Dec 10, 2024
7d4d63b
Merge pull request #6500 from grondo/issue#5957
mergify[bot] Dec 11, 2024
8c0399c
libflux: add "confdir" to flux_conf_builtin_get()
grondo Dec 6, 2024
23a7df7
python: add flux.conf_builtin.conf_builtin_get()
grondo Dec 6, 2024
024e5cb
testsuite: add python unit tests for conf_builtin_get()
grondo Dec 6, 2024
cb2b6f3
Merge pull request #6486 from grondo/python-conf-builtin
mergify[bot] Dec 11, 2024
60c8cf0
libflux: fix code formatting
garlick Dec 6, 2024
c39cb49
doc: update section 3 manual with size_t changes
garlick Dec 2, 2024
63551b4
libflux: use size_t for message/blob sizes
garlick Dec 6, 2024
b827295
libutil/blobref: use size_t not int
garlick Dec 6, 2024
e2a5286
libfilemap: use size_t not int
garlick Dec 6, 2024
af5e55d
libzmqutil: use size_t not int
garlick Dec 6, 2024
174a1d4
libutil/stdlog: use size_t not int
garlick Dec 6, 2024
8ee6a97
Merge pull request #6467 from garlick/use_size_t
grondo Dec 11, 2024
69aaf2a
job-info: fix parameter indentation
chu11 Dec 10, 2024
15aad44
job-info: avoid error response on failed rpc
chu11 Dec 12, 2024
fd8afd5
Merge pull request #6502 from chu11/issue6498_job_info_failed_rpc
mergify[bot] Dec 13, 2024
3cb37dd
python: avoid error in Jobspec.add_file when encoding is set
grondo Dec 13, 2024
1f4d61c
python: cli: move handling of `--add-file` option to function
grondo Dec 13, 2024
e3ea357
python: cli: fix handling of newline in `--add-file`
grondo Dec 13, 2024
ad07a9e
testsuite: test `flux submit --add-file=name=DATA`
grondo Dec 13, 2024
8f575da
Merge pull request #6504 from grondo/add-file-data
mergify[bot] Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
doc: add TROUBLESHOOTING to flux-housekeeping(1)
Problem: some subtleties of flux-housekeeping are undocumented.

Document the fact that housekeeping runs by default under systemd
and provide some insight into how the system behaves when it fails
or hangs.
  • Loading branch information
garlick committed Nov 12, 2024
commit 4adf007414eeca7ed4bd5a7e9a4dfc5853e16971
47 changes: 47 additions & 0 deletions doc/man1/flux-housekeeping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ housekeeping service. It supports listing the resources currently executing
housekeeping actions and a command to forcibly terminate actions on a per-job
or per-node basis.

In a Flux system instance, housekeeping is configured by default to run as a
one-shot :linux:man5:`systemd.unit`. See :ref:`troubleshooting` below.

COMMANDS
========
Expand Down Expand Up @@ -127,6 +129,51 @@ The following field names can be specified for
**pending.ranks**
The list of nodes that still need to complete housekeeping.

.. _troubleshooting:

TROUBLESHOOTING
===============

In a Flux system instance, housekeeping is configured by default to run as a
:linux:man5:`systemd.unit` named ``flux-housekeeping@JOBID``.

:linux:man1:`systemctl` can show the status of housekeeping units running
on the local node::

$ systemctl status flux-housekeeping@*

:linux:man1:`journalctl` shows standard output and error of a housekeeping
run::

$ journalctl -u flux-housekeeping@f4aTGTz2SN3

When housekeeping fails, the systemd unit script drains the failing nodes
with the reason obtained from systemd. For example, housekeeping runs
that failed due to a nonzero exit code are distinguished from those that
were aborted early due to a signal. In addition, a failure message is
logged to Flux and can be accessed with :man1:`flux-dmesg`.

When housekeeping hangs, no automated action is taken by Flux. Sending
housekeeping a signal with :program:`flux housekeeping kill` causes
:program:`systemctl stop` to be run on the housekeeping unit. Generally,
it is best to let systemd take over from there. Its default action is to
send SIGTERM to all processes in the control group, then SIGKILL if any
processes have not terminated after a 90s delay.

.. note::

On systems with scheduler configurations that permit jobs to share nodes,
multiple housekeeping units may execute concurrently on a single node.
Housekeeping scripts must be crafted with that in mind on such systems.

CAVEATS
=======

The ``flux-housekeeping@`` systemd unit is responsible for draining nodes
when housekeeping fails. Therefore if the system is configured to bypass
the systemd unit file, or if housekeeping is misconfigured such that the
the systemd unit file is not started, this draining does not occur.

RESOURCES
=========

Expand Down
2 changes: 2 additions & 0 deletions doc/test/spell.en.pws
Original file line number Diff line number Diff line change
Expand Up @@ -932,3 +932,5 @@ minnodes
cgroup
cgroups
fs
misconfigured
aTGTz