-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during reelection #695
Comments
We've had similar reports of this before, but I've always had trouble trying to repro them. I've been running a demo work set for hours and have yet to see it once. The reelection queries have a context timeout of 5 seconds on them, and it's possible that the problem is as simple as that for some set ups, that's just not enough time. Is it possible that you have a high latency link between your app and DB? Aside from that, any chance you have a slow query log or some other similar insight enabled in Postgres? https://www.metisdata.io/blog/how-enabling-slow-query-log-enhances-postgres-observability If you could look for the reelection queries and see if they're taking a really a long time, that'd give us a path to start investigating (i.e. you could then extract one and The election/reelection queries looking like this: -- name: LeaderAttemptElect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
DO NOTHING;
-- name: LeaderAttemptReelect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
DO UPDATE SET
expires_at = now() + @ttl
WHERE
river_leader.leader_id = @leader_id; |
Thanks for your help @brandur. Only the request
Its explain plan is:
The plan of
Everything seems correct to me on that part. |
Thanks for the extra detail. Weird ... yeah that looks fine to me. The total time of 20 sec seems concerning, but that's the aggregate sum of all execution times so that's actually fine. One more: it's not possible that you have a fairly slow link between DB and app? Like, are you hosting DB/app across regions or across providers of something of that nature? That could explain how it works a lot of the time, but where an upper bound of 5 seconds might not be enough. |
I don't think the link between db and server is the issue. It's fast and reliable (low activity, same datacenter). FYI it's a standard configuration with a "Standard 512M" Postgres on scalingo.com The weird thing could be the log field "sleep_duration". I don't know the unit, but values are always between 900 and 2'000. Note that two new errors occurred today:
|
That seems to indicate that the listener lost it's connection, which should be quite a rare occurrence. It's probably the same root cause as whatever's causing the trouble on your link with the elector. Regarding sleep duration. Here's the code for that: sleepDuration := serviceutil.ExponentialBackoff(attempt, serviceutil.MaxAttemptsBeforeResetDefault)
n.Logger.ErrorContext(ctx, n.Name+": Error running listener (will attempt reconnect after backoff)",
"attempt", attempt, "err", err, "sleep_duration", sleepDuration) Slog's behavior is such that if you're using a text handler, it'll use package main
import (
"log/slog"
"os"
"time"
)
func main() {
dur := 500 * time.Millisecond
loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
loggerText.Info("with text handler", "sleep_duration", dur)
loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
loggerJSON.Info("with JSON handler", "sleep_duration", dur)
} $ go run main.go
time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
{"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} You're getting what looks like a milliseconds number as a float, so you must be using some kind of custom slog handler to get that. |
While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.
While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.
While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially not very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.
While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially not very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.
While looking into #695 the other day, I was reminded that the handling of `time.Duration` in things like logging is potentially not very good. `time.Duration` has a good `String()` override that tends to get used with text handlers, but for various legacy reasons it doesn't have a `MarshalJSON` implementation, so when dumped to JSON it gets put in nanoseconds, which isn't very readable for human or computer. This is relevant to use because slog's `JSONHandler` and `TextHandler` will probably be the most common logging instruments for River. e.g. func main() { dur := 500 * time.Millisecond loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil)) loggerText.Info("with text handler", "sleep_duration", dur) loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil)) loggerJSON.Info("with JSON handler", "sleep_duration", dur) } time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms {"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000} Here, change the various places that we log sleep to use a more definitive value for purposes of the log. In this case a duration string representation like `10s`. I debated making this a float instead that'd be represented in seconds because the float would be more parseable for ingestion engines like Splunk. I went with the former option in the end though because it's probably better for humans and most other cases, and we could always add an alternative `sleep_seconds` field that's a float later if someone asks for it.
We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:
Hope this helps |
The single process that processes River jobs generate this error multiple times a day:
Jobs are inserted by an insert-only client on another process. The few jobs (maximum 2 per day) are still correctly handled. Postgres is not overused and is in version 15.8.0. Driver is pgx v5.7.1. Connection is quite standard (no pgbouncer). River is v0.14.2.
How could I investigate the root cause of this issue?
The code is almost a duplication of your tutorial:
Thanks
The text was updated successfully, but these errors were encountered: