Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from flux-framework:master #38

Open
wants to merge 9,562 commits into
base: master
Choose a base branch
from

Conversation

pull[bot]
Copy link

@pull pull bot commented Mar 13, 2020

See Commits and Changes for more details.


Created by pull[bot]. Want to support this open source service? Please star it : )

@pull pull bot added the ⤵️ pull label Mar 13, 2020
@ghost
Copy link

ghost commented Nov 5, 2020

DeepCode's analysis on #fde90d found:

  • ⚠️ 5 warnings, ℹ️ 17 minor issues. 👇
  • ✔️ 15 issues were fixed.

Top issues

Description Example fixes
Using sprintf can lead to severe buffer overflow vulnerabilities. Use the safe alternative snprintf instead. Occurrences: 🔧 Example fixes
Potential nullptr dereference. 0 may return a nullptr if calloc failed to allocate memory. Consider adding a check for nullness. Occurrences: 🔧 Example fixes
The result of calloc, which may return null flows to the first argument of memcpy. This could result in undefined behavior. Consider adding a check for nullness. Occurrences: 🔧 Example fixes

👉 View analysis in DeepCode’s Dashboard | Configure the bot

@ghost
Copy link

ghost commented Nov 19, 2020

DeepCode's analysis on #b3b81a found:

  • ⚠️ 5 warnings, ℹ️ 17 minor issues. 👇
  • ✔️ 15 issues were fixed.

Top issues

Description Example fixes
Using sprintf can lead to severe buffer overflow vulnerabilities. Use the safe alternative snprintf instead. Occurrences: 🔧 Example fixes
Potential nullptr dereference. 0 may return a nullptr if calloc failed to allocate memory. Consider adding a check for nullness. Occurrences: 🔧 Example fixes
The result of calloc, which may return null flows to the first argument of memcpy. This could result in undefined behavior. Consider adding a check for nullness. Occurrences: 🔧 Example fixes

👉 View analysis in DeepCode’s Dashboard | Configure the bot

@codecov
Copy link

codecov bot commented Feb 16, 2023

Codecov Report

Merging #38 (1f44068) into master (a9edafa) will increase coverage by 1.69%.
The diff coverage is n/a.

❗ Current head 1f44068 differs from pull request most recent head 15fdc9c. Consider uploading reports for the commit 15fdc9c to get more accurate results

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
+ Coverage   81.28%   82.98%   +1.69%     
==========================================
  Files         247      428     +181     
  Lines       38262    75675   +37413     
==========================================
+ Hits        31103    62800   +31697     
- Misses       7159    12875    +5716     
Impacted Files Coverage Δ
src/shell/events.c 66.38% <0.00%> (-24.92%) ⬇️
src/common/libschedutil/init.c 66.66% <0.00%> (-19.70%) ⬇️
src/common/libpmi/simple_client.c 73.03% <0.00%> (-16.59%) ⬇️
src/modules/job-manager/kill.c 77.14% <0.00%> (-14.97%) ⬇️
src/broker/publisher.c 66.45% <0.00%> (-13.75%) ⬇️
src/common/libsubprocess/server.c 59.48% <0.00%> (-11.51%) ⬇️
src/connectors/ssh/ssh.c 84.17% <0.00%> (-9.53%) ⬇️
src/common/libflux/attr.c 83.87% <0.00%> (-9.47%) ⬇️
src/connectors/shmem/shmem.c 75.00% <0.00%> (-9.22%) ⬇️
src/shell/rcalc.c 78.78% <0.00%> (-8.46%) ⬇️
... and 450 more

garlick and others added 26 commits October 31, 2024 06:58
Problem: the sdexec code references 'flux imp kill' in an expanatory
comment, but we plan to remove the kill subcommand from the IMP and
have it forward signals to the cgroup.

Drop that comment.
Problem: sdexec_start_transient_unit() does not support the SendSIGKILL
attribute, which may need to be set to "off" for Flux guest jobs.

Add support for the SendSIGKILL atttribute.

To use, set the subprocess command option "SDEXEC_PROP_SendSIGKILL"
to "off" (or other valid systemd boolean config value).
Problem: there is no test coverage that confirms sdexec
can set KillMode and SendSIGKILL.

Add some tests.
Problem: For multi-user jobs spawned via SDEXEC, the systemd user
instance running as the flux user does not have permission to kill
guest processes, yet it does try and in the process may kill off
the only process that does have permission to continue cleanup efforts,
the IMP.

When the job is run by the IMP and sdexec, Set KillMode=process so that
systemd only delivers signals to the IMP, which it should forward
to the shell and/or cgroup per RFC 15.

Also set SendSIGKILL to "off" so that SIGKILL is never deployed against
the IMP.

Fixes #6399
sdexec: set KillMode=process SendSIGKILL=no for multi-user jobs
Problem: There is a small chance of receiving EADDRINUSE from
an overlay_bind() call if a prior test has not fully cleaned itself
up.

Destroy and recreate the zeroMQ context to ensure things are cleaned
up before continuing on to additional tests.

Fixes #6404
    broker/test: avoid race in zeromq cleanup
Problem: The `defaults` job frobnicator plugin fails to override
global defaults with queue defaults because the global default is
applied to the jobspec before the queue defaults are examined, and
since a value is already set in jobspec, the queue default is ignored.

Instead of applying global defaults before examining queue defaults,
update the `queue_defaults()` method to take a defaults dict and
update it with queue specific defaults. Then the combined defaults
are applied in one pass, avoiding the issue above.

Fixes #6400
Problem: A queue-specific default that overrides a global default is
not tested in the job frobnicator tests.

Add a test to t2112-job-ingest-frobnicator.t that ensures global
defaults can be overridden by queue defaults.
frobnicator: allow queue-specific defaults to override global defaults
Problem: One test in t0005-exec.t uses spaces instead of a tab
for indentation.

Fix tabbing for consistency.
Problem: In the near future additional code flux-exec will use the
zlistx_t data structure.  Some other code in flux-exec already uses
the zhashx_t data structure.  One lingering data structure still
uses the zlist_t data structure.

For consistency going forward, convert the lingering use of zlist_t
to zlistx_t.
Problem: libsubprocess now supports stdin flow control via credits,
but that is not used in flux-exec.

Support credits and flow control in flux-exec to avoid overflowing
the stdin buffer.

Fixes #4572
Problem: sdexec does not yet support flow control

Disable flow control if the service is set to sdexec.
Add a --stdin-flow=on|off hidden option to force it either way.
Problem: There is no coverage to ensure that stdin flow control
truly works with flux-exec.

Add a test to t0005-exec.t that sends a 5 meg file while only using
a 4K buffer.  A buffer overflow would almost certainly happen
if flow control was not working.
flux-exec: use subprocess credits to avoid overflowing stdin buffers
Problem: When the a prolog is canceled, but the cancel times out, no
job exception is raised and the job proceeds to start. No only this,
but the epilog can end up being run twice since there's an assumption
in the code that there will be no finish event.

Add a new flag to the perilog_proc struct to capture when a canceled
prolog times out. Emit a specific exception in emit_finish_event()
when this flag is set and force the prolog exit status to 1.
Problem: An error message is logged using `flux_log_error(3)` when
the prolog times out, but there is no errno to decode, so the log
message always displays `Success`.

Use `flux_log(3)` with `LOG_ERR` instead of `flux_log_error(3)`.
Problem: No tests check that the perilog plugin handles the case
where a job prolog is canceled and the cancel times out.

Add a test to t2274-manager-perilog-per-rank.t that cancels a prolog
that ignores SIGTERM and ensure the job does not start.
resolve perilog plugin issue that lets a job start after prolog timeout when cancellation fails
Problem: libsubprocess duplicates the child ends of channel socketpairs
onto stdin, stdout, stderr in the child, but leave the original copies
open.

Exclude those channels from the idset generated by subprocess_childfds().
Both fork.c and posix_spawn.c already include logic to close descriptors
that do not appear in this idset after stdio is duplicated.

Fixes #6415
Problem: there are no tests that check for leaked file descriptors
in subprocesses.

Add test.
libsubprocess: close extra file descriptors
Problem: flux core now requires the IMP signal forwarding features
of flux-security 0.13.0, but configure only checks for >= 0.9.0.

Modify configure to require >= 0.13.0.
Problem: flux core now requires the IMP signal forwarding features
of flux-security 0.13.0, but CI specifies requires 0.11.0.

Modify docker-run-checks.sh to require the newer version.
Problem: job-exec uses 'flux imp kill' to deliver signals to multi-user
jobs but that command is deprecated.

Don't call bulk_exec_set_imp_path().
garlick and others added 30 commits December 5, 2024 14:58
Problem: compilation of test program fails due to missing jansson headers.

Add $(JANSSON_CFLAGS) to AM_CPPFLAGS.
Problem: compilation fails on macos when %ju or %jd is used with
non-matching integer types.

Cast them to [u]intmax_t.
fix macos portability issues (fifth batch)
Problem: The Rv1 scheduling key as JGF can be large and time-consuming
to generate and distribute, but currently it is required on all
nodes. Since the key is really only used by the scheduler on rank 0,
it would be ideal if resource.scheduling was only read on rank 0.

Only process the resource.scheduling config key on rank 0.

Fixes #6480
resource: only read `resource.scheduling` config on rank 0
Problem: If a task closes stdin and the job shell tries to write
data to it, the shell is immediately terminated with SIGPIPE, but it
would be better to handle an EPIPE error instead of terminating the
entire job.

Block SIGPIPE in the job shell.

Fixes #6487
Problem: There's no test to ensure the job shell is not killed by
SIGPIPE when writing data to a task that has closed stdin.

Attempt to add a test for this issue. Add a note that the test is
inherently racy, and it is just a best effort at this time.
Problem: libflux-optparse.so has unresolved jansson symbols on focal.

A recent macos portability change dropped the linker option to
garbage collect unused sections, resulting in unresolved json_*
symbols in libflux-optparse.so from libmissing.la on focal.

The change, introduced in #6476, was supposed
to set a make variable $ld_gc_sections when that option is available,
but the change was incomplete and the variable was never set.

Add the missing configure logic.

Confirmed:
- json_ symbols are no longer unresolved in libflux-optparse.so on focal
- macos still builds, and json_ symbols do not appear there

Fixes #6496
build: use -Wl,--gc-sections when appropriate
Problem: a unit test uses SIGRTMAX + 1 as an invalid signal,
but SIGRTMAX is not defined on macos.

Use NSIG instead.  Although it is not standardized, it works
on both linux and macos.
Problem: EBADE was remapped to EINVAL on macos in the KVS but
a unit test is still looking for it.

Remap there as well.
Problem: a subprocess unit test fails to compile on macos
because 'environ' is undefined.

Define it as an extern, as has been done elsewhere.
Problem: flux-core has been partially ported to macos,
to the point of building but not running, but there is no
test to notify of regressions.

Add a simple build test to the CI workflow.

Fixes #6484
Problem: There's no interface to update the jobspec of a job
outside of a jobtap callback for that job.

Add flux_jobtap_jobspec_update_id_pack(). This function just does
some nominal checks on the target job, then wraps event_post_pack().
Problem: The jobtap API call flux_jobtap_jobspec_update_id_pack()
is not tested.

Add it to the testing in t2212-job-manager-plugins.t.
jobtap: add `flux_jobtap_jobspec_update_id_pack()`
Problem: It would be useful to query the builtin intree or installed
configuration path from libflux, but flux_conf_builtin_get() does
not support that specific value.

Add confdir to the list of supported keys for flux_conf_builtin_get().
Problem: Python programs do not have access to libflux builtin
configuration values via flux_conf_builtin_get().

Add flux.conf_builtin.conf_builtin_get() which provides access to this
function from Python. Unfortunately, the bahavior of FLUX_CONF_AUTO
must be duplicated because the builtin libflux intree vs installed
tests do not work when `python` is the executable (since it isn't
part of Flux).
Problem: There are no unit tests for the Python interface to
flux_conf_builtin_get().

Add some very simple tests.
add `flux.conf_builtin.conf_builtin_get()` to give Python access to compiled-in config values
Problem: some code does not conform to RFC 7 and project norms.

Break long parameter lists to one per line.
Format pack/unpack with key-value pairs on the same line.
Problem: some public functions have new footprints but the
section 3 manual does not reflect this.

Update manual.
Problem: some libflux interfaces use the C int type to represent
buffer and I/O sizes, but size_t, as an unsigned type with more
range, is more appropriate in public interfaces.

Convert the following public API interfaces to use size_t:

  flux_msg_get_payload()
  flux_msg_set_payload()
  flux_request_decode_raw()
  flux_request_encode_raw()
  flux_response_decode_raw()
  flux_response_encode_raw()
  flux_respond_raw()
  flux_event_encode_raw()
  flux_event_decode_raw()
  flux_event_publish_raw()
  flux_rpc_raw()
  flux_rpc_get_raw()
  flux_kvs_lookup_get_raw()
  flux_kvs_txn_put_raw()

  Update tests and in-tree users, where appropriate.
Problem: several functions in the internal blobref class use int
types where size_t would be more appropriate.

Switch to size_t.
Problem: several functions in the internal filemap and fileref classes
use int types where size_t would be more appropriate.

Switch to size_t.
Update unit test.
Problem: a function in the internal mpart class uses an int type where
size_t would be more appropriate.

Switch to size_t.
Problem: several functions in the internal stdlog class use int
types where size_t would be more appropriate.

Switch to size_t.
Update unit test.
Update users.
libflux: update API to use size_t where appropriate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants