Skip to content

Commit

Permalink
[IMP] core: mitigate possible deadlock on module installation
Browse files Browse the repository at this point in the history
When installing some modules, if an other action is undertaken around
the same time it is possible for the two to deadlock, leading to the
two workers being killed by the wallclock limit watcher. This should
only be an issue on the *threaded* server, workers should not be
affected, meaning the issue is not reproducible on runbot.

A relatively reliable way to trigger this issue manually is to create
an empty database, on the "apps" kanban view install the "sales"
application, and as soon as the UI gets unblocked install the "CRM"
application. An other method (also reliable but not really doable by
hand) is to simultanously log in and install the sales
application (`sale_management` module).

The core of the issue seems to be in `_button_immediate_function`:

1. `Other` has an env ready for use and has accessed various
   models (so has pretty shallow locks on the tables e.g.
   ACCESS SHARE).
2. `Install` takes the registry lock to create the new registry.
3. `Install` needs to update one of the tables `other` has touched,
   starts waiting on the `ACCESS EXCLUSIVE` lock (the only one
   `ACCESS SHARE` conflicts with) in order to execute DDL (most `ALTER
   TABLE` forms require exclusive access to the table).
4. `Other` needs a new environment (e.g. `sudo()`, `with_user`,
   `with_context`, ...), starts waiting on the registry lock.

At this point the two threads are deadlocked, `other` waits on the
registry lock which `install` holds, while `install` waits on a table
lock which `other` holds. Since one of the waits is on the application
side, Postgres' deadlock detector can not notice the issue. That one
of the locks is on the Python side is why only the threaded
server *should* be affected.

An initial seemingly promising mitigation attempt was to

    LOCK res_partner IN ACCESS EXCLUSIVE MODE

in the prelude of `_button_immediate_function` as `res.partner` is one
of the most commonly modified models, this would force
`_button_immediate_function` to wait until all existing requests have
completed and prevent later requests from progressing.

This turns out to be unreliable, as later requests could already have
acquired an environment and would race ahead as soon as the
transaction is committed if the scheduler lets them. Trying to lock
the registries earlier doesn't work as the locking is interleaved in
normal operation and we'd just deadlock there. The commit is because
`load_modules` does not take an externally provided cursor and instead
creates its own (thus its own connection and transaction). And because
of its lack of atomicity the issue might occur regardless.

An alternate mitigation is instead to set (or drastically reduce) the
lock wait delay during module installation, installation should
normally be entirely uncontended (or infeasible in production with
large traffic) so there is limited reason it'd be waiting several
seconds on a lock. Conveniently, this means instead of the thread
being killed entirely, the install request gets aborted *and retried*,
so it can succeed a little more slowly if that allows the other
request to complete and no other concurrent request causes the same
issue.

Other alternate mitigation which got discarded: reusing registries
when creating new environments (from existing ones) if the database is
the same, that works for some case of switching environments, it
doesn't work for other where we actually fetch a registry e.g. assets
generation calls `get_modules_order` which calls `module_boot` which
calls `module_installed_bypass_session` which gets a
registry. `get_modules_order` is the last place we know we have an
existing registry. Though maybe we could strip out the entire thing
and call `module_installed(self.env)` directly?

Issue 2581648

closes odoo#73866

X-original-commit: ab84d97
Signed-off-by: Xavier Morel (xmo) <[email protected]>
  • Loading branch information
xmo-odoo committed Jul 16, 2021
1 parent 3007fb8 commit 5d1674f
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions odoo/modules/loading.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,11 @@ def load_modules(db, force_demo=False, status=None, update_module=False):
models_to_check = set()

with db.cursor() as cr:
# prevent endless wait for locks on schema changes (during online
# installs) if a concurrent transaction has accessed the table;
# connection settings are automatically reset when the connection is
# borrowed from the pool
cr.execute("SET SESSION lock_timeout = '15s'")
if not odoo.modules.db.is_initialized(cr):
if not update_module:
_logger.error("Database %s not initialized, you can force it with `-i base`", cr.dbname)
Expand Down

0 comments on commit 5d1674f

Please sign in to comment.