Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during reelection #695

Open
simonbrandhof opened this issue Dec 18, 2024 · 6 comments
Open

Error during reelection #695

simonbrandhof opened this issue Dec 18, 2024 · 6 comments

Comments

@simonbrandhof
Copy link

The single process that processes River jobs generate this error multiple times a day:

{
    "_sysTime": "2024-12-16T19:04:31.633026767Z",
    "_time": "2024-12-16T19:04:31Z",
    "client_id": "magic-api-worker-1_2024_12_12T19_27_07_561681",
    "err": {
        "error": "error beginning transaction: context deadline exceeded",
        "kind": "*fmt.wrapError"
    },
    "level": "error",
    "message": "Elector: Error attempting reelection",
    "process": "worker",
    "sleep_duration": 1079.326989
}

Jobs are inserted by an insert-only client on another process. The few jobs (maximum 2 per day) are still correctly handled. Postgres is not overused and is in version 15.8.0. Driver is pgx v5.7.1. Connection is quite standard (no pgbouncer). River is v0.14.2.

How could I investigate the root cause of this issue?

The code is almost a duplication of your tutorial:

workers := river.NewWorkers()
river.AddWorker(workers, myWorker)

logger := slog.New(slogzerolog.Option{Level: slog.LevelInfo, Logger: &log.Logger}.NewZerologHandler())
riverClient, err := river.NewClient(riverpgxv5.New(dbPool), &river.Config{
    Logger: logger,
    Queues: map[string]river.QueueConfig{river.QueueDefault: {MaxWorkers: 10},},
    Workers: workers,
})

Thanks

@brandur
Copy link
Contributor

brandur commented Dec 18, 2024

We've had similar reports of this before, but I've always had trouble trying to repro them. I've been running a demo work set for hours and have yet to see it once.

The reelection queries have a context timeout of 5 seconds on them, and it's possible that the problem is as simple as that for some set ups, that's just not enough time. Is it possible that you have a high latency link between your app and DB?

Aside from that, any chance you have a slow query log or some other similar insight enabled in Postgres?

https://www.metisdata.io/blog/how-enabling-slow-query-log-enhances-postgres-observability

If you could look for the reelection queries and see if they're taking a really a long time, that'd give us a path to start investigating (i.e. you could then extract one and EXPLAIN ANALYZE it to get more detail).

The election/reelection queries looking like this:

-- name: LeaderAttemptElect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
    VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
    DO NOTHING;

-- name: LeaderAttemptReelect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
    VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
    DO UPDATE SET
        expires_at = now() + @ttl
    WHERE
        river_leader.leader_id = @leader_id;

@simonbrandhof
Copy link
Author

simonbrandhof commented Dec 19, 2024

Thanks for your help @brandur.

Only the request LeaderAttemptReelect is listed in the top 100 of query statistics (pg_stat_statements):

  • 237149 calls
  • 237149 rows
  • total time 20.11 sec
  • mean time 0.08 ms
  • min time 0.05 ms
  • max time 13.88 ms
  • time stddev 0.05 ms

Its explain plan is:

Insert on river_leader  (cost=0.00..0.01 rows=0 width=0) (actual time=0.190..0.190 rows=0 loops=1)
  Conflict Resolution: UPDATE
  Conflict Arbiter Indexes: river_leader_pkey
  Conflict Filter: (river_leader.leader_id = 'test'::text)
  Rows Removed by Conflict Filter: 1
  Tuples Inserted: 0
  Conflicting Tuples: 1
  ->  Result  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.114 ms
Execution Time: 0.236 ms

The plan of LeaderAttemptElect is:

Insert on river_leader  (cost=0.00..0.01 rows=0 width=0) (actual time=0.057..0.057 rows=0 loops=1)
  Conflict Resolution: NOTHING
  Conflict Arbiter Indexes: river_leader_pkey
  Tuples Inserted: 0
  Conflicting Tuples: 1
  ->  Result  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.075 ms
Execution Time: 0.085 ms

Everything seems correct to me on that part.

@brandur
Copy link
Contributor

brandur commented Dec 19, 2024

Thanks for the extra detail. Weird ... yeah that looks fine to me. The total time of 20 sec seems concerning, but that's the aggregate sum of all execution times so that's actually fine.

One more: it's not possible that you have a fairly slow link between DB and app? Like, are you hosting DB/app across regions or across providers of something of that nature? That could explain how it works a lot of the time, but where an upper bound of 5 seconds might not be enough.

@simonbrandhof
Copy link
Author

simonbrandhof commented Dec 19, 2024

I don't think the link between db and server is the issue. It's fast and reliable (low activity, same datacenter). FYI it's a standard configuration with a "Standard 512M" Postgres on scalingo.com

The weird thing could be the log field "sleep_duration". I don't know the unit, but values are always between 900 and 2'000.

Note that two new errors occurred today:

{
    "_sysTime": "2024-12-19T08:35:39.395206877Z",
    "_time": "2024-12-19T08:35:38Z",
    "err": {
        "error": "unexpected EOF",
        "kind": "*errors.errorString"
    },
    "level": "error",
    "message": "Notifier: Error from notification wait",
    "process": "worker"
}
{
    "_sysTime": "2024-12-19T08:35:39.452771129Z",
    "_time": "2024-12-19T08:35:38Z",
    "attempt": 0,
    "err": {
        "error": "unexpected EOF",
        "kind": "*errors.errorString"
    },
    "level": "error",
    "message": "Notifier: Error running listener (will attempt reconnect after backoff)",
    "process": "worker",
    "sleep_duration": 506.329371
}

@brandur
Copy link
Contributor

brandur commented Dec 20, 2024

That seems to indicate that the listener lost it's connection, which should be quite a rare occurrence. It's probably the same root cause as whatever's causing the trouble on your link with the elector.

Regarding sleep duration. Here's the code for that:

sleepDuration := serviceutil.ExponentialBackoff(attempt, serviceutil.MaxAttemptsBeforeResetDefault)
n.Logger.ErrorContext(ctx, n.Name+": Error running listener (will attempt reconnect after backoff)",
	"attempt", attempt, "err", err, "sleep_duration", sleepDuration)

Slog's behavior is such that if you're using a text handler, it'll use time.Duration's String(), which produces something like 500ms. If you're using a JSON handler, it gets printed as a number in nanoseconds:

package main

import (
	"log/slog"
	"os"
	"time"
)

func main() {
	dur := 500 * time.Millisecond

	loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
	loggerText.Info("with text handler", "sleep_duration", dur)

	loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
	loggerJSON.Info("with JSON handler", "sleep_duration", dur)
}
$ go run main.go
time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
{"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

You're getting what looks like a milliseconds number as a float, so you must be using some kind of custom slog handler to get that.

brandur added a commit that referenced this issue Dec 21, 2024
While looking into #695 the other day, I was reminded that the handling
of `time.Duration` in things like logging is potentially very good.

`time.Duration` has a good `String()` override that tends to get used
with text handlers, but for various legacy reasons it doesn't have a
`MarshalJSON` implementation, so when dumped to JSON it gets put in
nanoseconds, which isn't very readable for human or computer.

This is relevant to use because slog's `JSONHandler` and `TextHandler`
will probably be the most common logging instruments for River. e.g.

    func main() {
        dur := 500 * time.Millisecond

        loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
        loggerText.Info("with text handler", "sleep_duration", dur)

        loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
        loggerJSON.Info("with JSON handler", "sleep_duration", dur)
    }

    time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
    {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

Here, change the various places that we log sleep to use a more
definitive value for purposes of the log. In this case a duration string
representation like `10s`.

I debated making this a float instead that'd be represented in seconds
because the float would be more parseable for ingestion engines like
Splunk. I went with the former option in the end though because it's
probably better for humans and most other cases, and we could always add
an alternative `sleep_seconds` field that's a float later if someone
asks for it.
brandur added a commit that referenced this issue Dec 21, 2024
While looking into #695 the other day, I was reminded that the handling
of `time.Duration` in things like logging is potentially very good.

`time.Duration` has a good `String()` override that tends to get used
with text handlers, but for various legacy reasons it doesn't have a
`MarshalJSON` implementation, so when dumped to JSON it gets put in
nanoseconds, which isn't very readable for human or computer.

This is relevant to use because slog's `JSONHandler` and `TextHandler`
will probably be the most common logging instruments for River. e.g.

    func main() {
        dur := 500 * time.Millisecond

        loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
        loggerText.Info("with text handler", "sleep_duration", dur)

        loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
        loggerJSON.Info("with JSON handler", "sleep_duration", dur)
    }

    time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
    {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

Here, change the various places that we log sleep to use a more
definitive value for purposes of the log. In this case a duration string
representation like `10s`.

I debated making this a float instead that'd be represented in seconds
because the float would be more parseable for ingestion engines like
Splunk. I went with the former option in the end though because it's
probably better for humans and most other cases, and we could always add
an alternative `sleep_seconds` field that's a float later if someone
asks for it.
brandur added a commit that referenced this issue Dec 21, 2024
While looking into #695 the other day, I was reminded that the handling
of `time.Duration` in things like logging is potentially not very good.

`time.Duration` has a good `String()` override that tends to get used
with text handlers, but for various legacy reasons it doesn't have a
`MarshalJSON` implementation, so when dumped to JSON it gets put in
nanoseconds, which isn't very readable for human or computer.

This is relevant to use because slog's `JSONHandler` and `TextHandler`
will probably be the most common logging instruments for River. e.g.

    func main() {
        dur := 500 * time.Millisecond

        loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
        loggerText.Info("with text handler", "sleep_duration", dur)

        loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
        loggerJSON.Info("with JSON handler", "sleep_duration", dur)
    }

    time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
    {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

Here, change the various places that we log sleep to use a more
definitive value for purposes of the log. In this case a duration string
representation like `10s`.

I debated making this a float instead that'd be represented in seconds
because the float would be more parseable for ingestion engines like
Splunk. I went with the former option in the end though because it's
probably better for humans and most other cases, and we could always add
an alternative `sleep_seconds` field that's a float later if someone
asks for it.
brandur added a commit that referenced this issue Dec 21, 2024
While looking into #695 the other day, I was reminded that the handling
of `time.Duration` in things like logging is potentially not very good.

`time.Duration` has a good `String()` override that tends to get used
with text handlers, but for various legacy reasons it doesn't have a
`MarshalJSON` implementation, so when dumped to JSON it gets put in
nanoseconds, which isn't very readable for human or computer.

This is relevant to use because slog's `JSONHandler` and `TextHandler`
will probably be the most common logging instruments for River. e.g.

    func main() {
        dur := 500 * time.Millisecond

        loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
        loggerText.Info("with text handler", "sleep_duration", dur)

        loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
        loggerJSON.Info("with JSON handler", "sleep_duration", dur)
    }

    time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
    {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

Here, change the various places that we log sleep to use a more
definitive value for purposes of the log. In this case a duration string
representation like `10s`.

I debated making this a float instead that'd be represented in seconds
because the float would be more parseable for ingestion engines like
Splunk. I went with the former option in the end though because it's
probably better for humans and most other cases, and we could always add
an alternative `sleep_seconds` field that's a float later if someone
asks for it.
brandur added a commit that referenced this issue Dec 22, 2024
While looking into #695 the other day, I was reminded that the handling
of `time.Duration` in things like logging is potentially not very good.

`time.Duration` has a good `String()` override that tends to get used
with text handlers, but for various legacy reasons it doesn't have a
`MarshalJSON` implementation, so when dumped to JSON it gets put in
nanoseconds, which isn't very readable for human or computer.

This is relevant to use because slog's `JSONHandler` and `TextHandler`
will probably be the most common logging instruments for River. e.g.

    func main() {
        dur := 500 * time.Millisecond

        loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
        loggerText.Info("with text handler", "sleep_duration", dur)

        loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
        loggerJSON.Info("with JSON handler", "sleep_duration", dur)
    }

    time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
    {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

Here, change the various places that we log sleep to use a more
definitive value for purposes of the log. In this case a duration string
representation like `10s`.

I debated making this a float instead that'd be represented in seconds
because the float would be more parseable for ingestion engines like
Splunk. I went with the former option in the end though because it's
probably better for humans and most other cases, and we could always add
an alternative `sleep_seconds` field that's a float later if someone
asks for it.
@emilecaron
Copy link

We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:

time=2025-02-05T14:12:05.277Z level=ERROR msg="Elector: Error attempting reelection" client_id=c2b93b2b3676_2025_02_05T14_07_32_004429 err="error beginning transaction: context deadline exceeded" sleep_duration=905.18934ms

Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants