Error during reelection #695

simonbrandhof · 2024-12-18T14:59:01Z

The single process that processes River jobs generate this error multiple times a day:

{
    "_sysTime": "2024-12-16T19:04:31.633026767Z",
    "_time": "2024-12-16T19:04:31Z",
    "client_id": "magic-api-worker-1_2024_12_12T19_27_07_561681",
    "err": {
        "error": "error beginning transaction: context deadline exceeded",
        "kind": "*fmt.wrapError"
    },
    "level": "error",
    "message": "Elector: Error attempting reelection",
    "process": "worker",
    "sleep_duration": 1079.326989
}

Jobs are inserted by an insert-only client on another process. The few jobs (maximum 2 per day) are still correctly handled. Postgres is not overused and is in version 15.8.0. Driver is pgx v5.7.1. Connection is quite standard (no pgbouncer). River is v0.14.2.

How could I investigate the root cause of this issue?

The code is almost a duplication of your tutorial:

workers := river.NewWorkers()
river.AddWorker(workers, myWorker)

logger := slog.New(slogzerolog.Option{Level: slog.LevelInfo, Logger: &log.Logger}.NewZerologHandler())
riverClient, err := river.NewClient(riverpgxv5.New(dbPool), &river.Config{
    Logger: logger,
    Queues: map[string]river.QueueConfig{river.QueueDefault: {MaxWorkers: 10},},
    Workers: workers,
})

Thanks

The text was updated successfully, but these errors were encountered:

brandur · 2024-12-18T18:08:40Z

We've had similar reports of this before, but I've always had trouble trying to repro them. I've been running a demo work set for hours and have yet to see it once.

The reelection queries have a context timeout of 5 seconds on them, and it's possible that the problem is as simple as that for some set ups, that's just not enough time. Is it possible that you have a high latency link between your app and DB?

Aside from that, any chance you have a slow query log or some other similar insight enabled in Postgres?

https://www.metisdata.io/blog/how-enabling-slow-query-log-enhances-postgres-observability

If you could look for the reelection queries and see if they're taking a really a long time, that'd give us a path to start investigating (i.e. you could then extract one and EXPLAIN ANALYZE it to get more detail).

The election/reelection queries looking like this:

-- name: LeaderAttemptElect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
    VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
    DO NOTHING;

-- name: LeaderAttemptReelect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
    VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
    DO UPDATE SET
        expires_at = now() + @ttl
    WHERE
        river_leader.leader_id = @leader_id;

simonbrandhof · 2024-12-19T14:41:05Z

Thanks for your help @brandur.

Only the request LeaderAttemptReelect is listed in the top 100 of query statistics (pg_stat_statements):

237149 calls
237149 rows
total time 20.11 sec
mean time 0.08 ms
min time 0.05 ms
max time 13.88 ms
time stddev 0.05 ms

Its explain plan is:

Insert on river_leader  (cost=0.00..0.01 rows=0 width=0) (actual time=0.190..0.190 rows=0 loops=1)
  Conflict Resolution: UPDATE
  Conflict Arbiter Indexes: river_leader_pkey
  Conflict Filter: (river_leader.leader_id = 'test'::text)
  Rows Removed by Conflict Filter: 1
  Tuples Inserted: 0
  Conflicting Tuples: 1
  ->  Result  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.114 ms
Execution Time: 0.236 ms

The plan of LeaderAttemptElect is:

Insert on river_leader  (cost=0.00..0.01 rows=0 width=0) (actual time=0.057..0.057 rows=0 loops=1)
  Conflict Resolution: NOTHING
  Conflict Arbiter Indexes: river_leader_pkey
  Tuples Inserted: 0
  Conflicting Tuples: 1
  ->  Result  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.075 ms
Execution Time: 0.085 ms

Everything seems correct to me on that part.

brandur · 2024-12-19T16:24:43Z

Thanks for the extra detail. Weird ... yeah that looks fine to me. The total time of 20 sec seems concerning, but that's the aggregate sum of all execution times so that's actually fine.

One more: it's not possible that you have a fairly slow link between DB and app? Like, are you hosting DB/app across regions or across providers of something of that nature? That could explain how it works a lot of the time, but where an upper bound of 5 seconds might not be enough.

simonbrandhof · 2024-12-19T21:10:03Z

I don't think the link between db and server is the issue. It's fast and reliable (low activity, same datacenter). FYI it's a standard configuration with a "Standard 512M" Postgres on scalingo.com

The weird thing could be the log field "sleep_duration". I don't know the unit, but values are always between 900 and 2'000.

Note that two new errors occurred today:

{
    "_sysTime": "2024-12-19T08:35:39.395206877Z",
    "_time": "2024-12-19T08:35:38Z",
    "err": {
        "error": "unexpected EOF",
        "kind": "*errors.errorString"
    },
    "level": "error",
    "message": "Notifier: Error from notification wait",
    "process": "worker"
}

{
    "_sysTime": "2024-12-19T08:35:39.452771129Z",
    "_time": "2024-12-19T08:35:38Z",
    "attempt": 0,
    "err": {
        "error": "unexpected EOF",
        "kind": "*errors.errorString"
    },
    "level": "error",
    "message": "Notifier: Error running listener (will attempt reconnect after backoff)",
    "process": "worker",
    "sleep_duration": 506.329371
}

brandur · 2024-12-20T04:42:00Z

That seems to indicate that the listener lost it's connection, which should be quite a rare occurrence. It's probably the same root cause as whatever's causing the trouble on your link with the elector.

Regarding sleep duration. Here's the code for that:

sleepDuration := serviceutil.ExponentialBackoff(attempt, serviceutil.MaxAttemptsBeforeResetDefault)
n.Logger.ErrorContext(ctx, n.Name+": Error running listener (will attempt reconnect after backoff)",
	"attempt", attempt, "err", err, "sleep_duration", sleepDuration)

Slog's behavior is such that if you're using a text handler, it'll use time.Duration's String(), which produces something like 500ms. If you're using a JSON handler, it gets printed as a number in nanoseconds:

package main

import (
	"log/slog"
	"os"
	"time"
)

func main() {
	dur := 500 * time.Millisecond

	loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
	loggerText.Info("with text handler", "sleep_duration", dur)

	loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
	loggerJSON.Info("with JSON handler", "sleep_duration", dur)
}

$ go run main.go
time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
{"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

You're getting what looks like a milliseconds number as a float, so you must be using some kind of custom slog handler to get that.

While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.

While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially not very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.

emilecaron · 2025-02-05T14:13:36Z

We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:

time=2025-02-05T14:12:05.277Z level=ERROR msg="Elector: Error attempting reelection" client_id=c2b93b2b3676_2025_02_05T14_07_32_004429 err="error beginning transaction: context deadline exceeded" sleep_duration=905.18934ms

Hope this helps

brandur mentioned this issue Dec 21, 2024

Log sleep durations more definitively #699

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during reelection #695

Error during reelection #695

simonbrandhof commented Dec 18, 2024

brandur commented Dec 18, 2024

simonbrandhof commented Dec 19, 2024 •

edited

Loading

brandur commented Dec 19, 2024 •

edited

Loading

simonbrandhof commented Dec 19, 2024 •

edited

Loading

brandur commented Dec 20, 2024

emilecaron commented Feb 5, 2025

Error during reelection #695

Error during reelection #695

Comments

simonbrandhof commented Dec 18, 2024

brandur commented Dec 18, 2024

simonbrandhof commented Dec 19, 2024 • edited Loading

brandur commented Dec 19, 2024 • edited Loading

simonbrandhof commented Dec 19, 2024 • edited Loading

brandur commented Dec 20, 2024

emilecaron commented Feb 5, 2025

simonbrandhof commented Dec 19, 2024 •

edited

Loading

brandur commented Dec 19, 2024 •

edited

Loading

simonbrandhof commented Dec 19, 2024 •

edited

Loading