Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turbo daemon creates / leaves a ton of <defunct> processes, accumulating enough sometimes to breach the OS-wide process limit, preventing the creation of any new processes. #9455

Closed
1 task done
NullVoxPopuli opened this issue Nov 18, 2024 · 8 comments
Labels
kind: bug Something isn't working

Comments

@NullVoxPopuli
Copy link

NullVoxPopuli commented Nov 18, 2024

Verify canary release

  • I verified that the issue exists in the latest Turborepo canary release.

Link to code that reproduces this issue

I think: all turbo projects running turbo while in interactive-rebase.

This is a pretty bad bug, because MacOS only has a limit of ~ 5600 processes, and once you hit that, you can't spawn terminals, can't open apps, can't create new tabs in the browser, can't run ps, even.

You have to have already had activity monitor (or similar) open so that you can kill the turbo daemon process. Else you may be forced to reboot.

Which canary version will you have in your reproduction?

2.3.1-canary.0

Enviroment information

❯ pnpm turbo info
turbo 2.3.1-canary.0

CLI:
   Version: 2.3.1-canary.0
   Path to executable: <.pnpm>/[email protected]/node_modules/turbo-darwin-arm64/bin/turbo
   Daemon status: Running
   Package manager: pnpm9

Platform:
   Architecture: aarch64
   Operating system: macos
   WSL: false
   Available memory (MB): 10455
   Available CPU cores: 12

Environment:
   CI: None
   Terminal (TERM): alacritty
   Terminal program (TERM_PROGRAM): unknown
   Terminal program version (TERM_PROGRAM_VERSION): unknown
   Shell (SHELL): /opt/homebrew/Cellar/bash/5.2.32/bin/bash
   stdin: false

Setup, check processes:

ps -ef | grep defunct | wc -l
# 1 or 2

Normally, an OS should be around < 1000 processes:

ps -ef | wc -l
# I usually hover around 600 to 800

Scenario A (inconsistent)

  • be in interactive rebase
    (I'm splitting commits into more commits)
  • have prepare or postinstall trigger turbo's build
  • run turbo again (maybe for lint, or whatever)

Scenario B (inconsistent)

  • after changing a dependency of a package

Test:

ps -ef | grep defunct | wc -l
# 807

Test after upgrading to latest canary (noting that we run build in postinstall):

❯ ps -ef | grep defunct | wc -l
#    1435

I have an ongoing monitor for this running every second in a terminal that I just leave up all the time.

❯ watch -n 1 "echo \"All: \$(ps -ef | wc -l), Defunct: \$(ps -ef | grep defunct | wc -l)\""

And with pstree we can see that these all come from turbo

# get a list of all unique parent processes for each defunct process
❯ ps -ef | grep defunct | awk '{print $3}' | sort -u

# pass each of these to pstree
while IFS= read -r pid; do
    pstree -p $pid
done <<< $(ps -ef | grep defunct | awk '{print $3}' | sort -u)

Which will print something like this:

-+= 00001 root /sbin/launchd
 \-+= 11557 $USER /opt/homebrew/opt/borders/bin/borders
   \--- 11558 $USER <defunct>
-+= 00001 root /sbin/launchd
 \-+= 43271 $USER <.pnpm>/[email protected]/node_modules/turbo-darwin-arm64/bin/turbo --skip-infer daemon
   |--- 43359 $USER <defunct>
   |--- 43361 $USER <defunct>
   # and a few many hundred more
   \--- 57042 $USER <defunct>

Expected behavior

no defunct processes exist ever, as the OS will not halt these.

Actual behavior

defunct processes are left laying around.

To Reproduce

It's possible this is reproducible in these OSS repos:

I somewhat regularly have to kill the top level turbo daemon on Linux due to CPU usage -- but it's maybe possible that the reason for that is the same root reason that is causing me to observe the behavior that has resulted in me reporting this issue for MacOS.

In both cases, Linux (where I do most of my OSS) and Mac (where I do my closed-source employer-owned work), Killing the turbo daemon processes immediately makes any of my machines happier -- cleaning up defunct processes (macos) or freeing up cpu cycles (linux)

Additional context

No response

@NullVoxPopuli NullVoxPopuli added kind: bug Something isn't working needs: triage New issues get this label. Remove it after triage labels Nov 18, 2024
@wagenet
Copy link

wagenet commented Nov 18, 2024

We've seen this on other developer machines at my company as well.

@chris-olszewski
Copy link
Member

If either of you could share daemon logs (turbo daemon status should display the logfile) that would be helpful. We should not be spawning child processes from the daemon.

@chris-olszewski chris-olszewski removed the needs: triage New issues get this label. Remove it after triage label Nov 18, 2024
@NullVoxPopuli
Copy link
Author

NullVoxPopuli commented Nov 19, 2024

Here is what I got:

❯ pnpm turbo daemon status
# ...
✓ daemon is running
log file: <repo>/.turbo/daemon/e224a4a441d772ef-turbo.log.2024-11-19
uptime: 16m 6s 566mss
pid file: /var/folders/wk/w99lck4x7_5930c7gj65r3s40000gp/T/turbod/e224a4a441d772ef/turbod.pid
socket file: /var/folders/wk/w99lck4x7_5930c7gj65r3s40000gp/T/turbod/e224a4a441d772ef/turbod.sock
ope, big file

there is a lot of text

There was a problem saving your comment. 
Your comment is too long (maximum is 65536 characters). 
Please try again.

oops 🙈

here is a file tho

output.txt

as I was poking around in here, I noticed there was a lot of activity from watchman cookies.

@NullVoxPopuli
Copy link
Author

It seems this is happening nearly daily for me -- can't really pinpoint what is causing the defunct processes to show up. In Activity Monitor, I do occasionally see > 20 git processes spawn, and then go away -- maybe related? idk.

@anthonyshew anthonyshew changed the title 🐛 Bug: Turbo daemon creates / leaves a ton of <defunct> processes, accumulating enough sometimes to breach the OS-wide process limit, preventing the creation of any new processes. Turbo daemon creates / leaves a ton of <defunct> processes, accumulating enough sometimes to breach the OS-wide process limit, preventing the creation of any new processes. Dec 2, 2024
@NullVoxPopuli
Copy link
Author

We are trying setting https://turbo.build/repo/docs/reference/configuration#daemon to false for the time being. 🤞

chris-olszewski added a commit that referenced this issue Dec 5, 2024
### Description

In the case of an error when parsing `git` output. We would drop a
`Child` without `wait`ing on it which results in a zombie process as the
pid is never reaped.

From [Rust
docs](https://doc.rust-lang.org/std/process/struct.Child.html#warning)

> On some systems, calling
[wait](https://doc.rust-lang.org/std/process/struct.Child.html#method.wait)
or similar is necessary for the OS to release resources. A process that
terminated but has not been waited on is still around as a “zombie”.
Leaving too many zombies around may exhaust global resources (for
example process IDs).

> The standard library does not automatically wait on child processes
(not even if the Child is dropped), it is up to the application
developer to do so. As a consequence, dropping Child handles without
waiting on them first is not recommended in long-running applications.

When there was a parse error we would `kill` the child process, but
never reap the pid. This PR ensures we make a best effort to do just
that. The way I'm calling wait is probably overkill, but I wanted to
ensure that we don't introduce any accidental waiting on a process that
didn't receive the kill signal.

Sources for comments:
 - [unix](https://man7.org/linux/man-pages/man2/kill.2.html)
-
[windows](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminateprocess#return-value)

### Testing Instructions

I have done some manual confirmation that this works for a command like
`bash -c "sleep 100"` where it will

Hoping to get someone from
#9455 to test this out in a
canary and confirm this helps.
@chris-olszewski
Copy link
Member

It seems this is happening nearly daily for me -- can't really pinpoint what is causing the defunct processes to show up. In Activity Monitor, I do occasionally see > 20 git processes spawn, and then go away -- maybe related? idk.

Thank you so much for this comment! I didn't realize the daemon shelled out to git and there was in fact a bug where those git processes weren't getting reaped.

We should be correctly reaping child git processes with #9564 which is being released in 2.3.4-canary.1 which I will cut today. I would greatly appreciate if you could test it out.

@NullVoxPopuli
Copy link
Author

Thanks, @chris-olszewski !

I've tested with 2.3.4-canary.2
and had a control as well to verify defunct processes were still getting created (yay git worktrees!)

So far, I've not seen any defunct processes spawn from the canary.2

pending

Process

  1. I'm watching total process count vs defunct count via

    watch -n 1 "echo \"All: \$(ps -ef | wc -l), Defunct: \$(ps -ef | grep defunct | wc -l)\""

    looks like this:
    image

  2. The command I'm running is pnpm build --no-cache --force so turbo actually does stuff 😉 (too efficient otherwise!). We have a wrapper CLI that mixes in some environment variables, flags, and handles whether or not to reach out to the remote cache with a custom AWS S3 SSO

    pnpm turbo --color --no-update-notifier \
      --env-mode=loose --summarize=true --output-logs=new-only \
      _:build \ # We use a _: prefix because we need to define "build" in the package.json, but also want `build` in each package to go through turbo
      --filter=./libraries/**/* --no-cache --force
  3. In my two branches, I've removed "daemon": false from the turbo.json at the root of the repo

  4. I'm running the pnpm build --no-cache --force command 4 times to make sure behavior is consistent. Each time I run it, I make note of the total processes before and after, as well as defunct processes.

  5. Starting with a fresh rebase on the main branch so I don't have any local caches, deleted node_modules, etc

    # once 
    killall turbo
    git fetch origin
    
    # each branch / worktree
    get rebase origin/default-branch-name
    nuke # local recursive clean script here:  https://github.com/NullVoxPopuli/dotfiles/blob/323173c6042882a17079bccca7149985038dd1b6/home/scripts/bash-support/aliases.sh#L8
    pnpm install # runs an initial build via postinstall

Results

Baseline env

All Processes Defunct Processes
659 6

Note that total process count will fluctuate a bit, because the OS does do things. 🙈

The following tables will use the format [starting process count, ending process count]
example: [659, 656] would mean that before I ran the build command, we started with 659 total processes and ended up with 656)

control branch with turbo @ 2.3.3

expected outcome: defunct processes spawn

Run All Processes Defunct Processes
1 [657, ] [6, ]
1 [657, ] [6, ]
1 [657, ] [6, ]
1 [657, ] [6, ]

branch with turbo @ 2.3.4-canary.2

expected outcome: defunct process count does not grow at all, for the entirety of the duration of the command

Run All Processes Defunct Processes
1 [657, ] [6, ]
1 [657, ] [6, ]
1 [657, ] [6, ]
1 [657, ] [6, ]

I need to wait for one of my worktrees to reproduce the issue before I collect data.

Been trying to re-create the situation manually, but it's clear I still don't know the right order operations to reproduce the defunct spawning problem.

@anthonyshew
Copy link
Contributor

Awesome work, folks. Thank you, @NullVoxPopuli, for your thoroughness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants