Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[IMP] core: mitigate possible deadlock on module installation
When installing some modules, if an other action is undertaken around the same time it is possible for the two to deadlock, leading to the two workers being killed by the wallclock limit watcher. This should only be an issue on the *threaded* server, workers should not be affected, meaning the issue is not reproducible on runbot. A relatively reliable way to trigger this issue manually is to create an empty database, on the "apps" kanban view install the "sales" application, and as soon as the UI gets unblocked install the "CRM" application. An other method (also reliable but not really doable by hand) is to simultanously log in and install the sales application (`sale_management` module). The core of the issue seems to be in `_button_immediate_function`: 1. `Other` has an env ready for use and has accessed various models (so has pretty shallow locks on the tables e.g. ACCESS SHARE). 2. `Install` takes the registry lock to create the new registry. 3. `Install` needs to update one of the tables `other` has touched, starts waiting on the `ACCESS EXCLUSIVE` lock (the only one `ACCESS SHARE` conflicts with) in order to execute DDL (most `ALTER TABLE` forms require exclusive access to the table). 4. `Other` needs a new environment (e.g. `sudo()`, `with_user`, `with_context`, ...), starts waiting on the registry lock. At this point the two threads are deadlocked, `other` waits on the registry lock which `install` holds, while `install` waits on a table lock which `other` holds. Since one of the waits is on the application side, Postgres' deadlock detector can not notice the issue. That one of the locks is on the Python side is why only the threaded server *should* be affected. An initial seemingly promising mitigation attempt was to LOCK res_partner IN ACCESS EXCLUSIVE MODE in the prelude of `_button_immediate_function` as `res.partner` is one of the most commonly modified models, this would force `_button_immediate_function` to wait until all existing requests have completed and prevent later requests from progressing. This turns out to be unreliable, as later requests could already have acquired an environment and would race ahead as soon as the transaction is committed if the scheduler lets them. Trying to lock the registries earlier doesn't work as the locking is interleaved in normal operation and we'd just deadlock there. The commit is because `load_modules` does not take an externally provided cursor and instead creates its own (thus its own connection and transaction). And because of its lack of atomicity the issue might occur regardless. An alternate mitigation is instead to set (or drastically reduce) the lock wait delay during module installation, installation should normally be entirely uncontended (or infeasible in production with large traffic) so there is limited reason it'd be waiting several seconds on a lock. Conveniently, this means instead of the thread being killed entirely, the install request gets aborted *and retried*, so it can succeed a little more slowly if that allows the other request to complete and no other concurrent request causes the same issue. Other alternate mitigation which got discarded: reusing registries when creating new environments (from existing ones) if the database is the same, that works for some case of switching environments, it doesn't work for other where we actually fetch a registry e.g. assets generation calls `get_modules_order` which calls `module_boot` which calls `module_installed_bypass_session` which gets a registry. `get_modules_order` is the last place we know we have an existing registry. Though maybe we could strip out the entire thing and call `module_installed(self.env)` directly? Issue 2581648 closes odoo#73866 X-original-commit: ab84d97 Signed-off-by: Xavier Morel (xmo) <[email protected]>
- Loading branch information