Skip to content

Commit

Permalink
Fix sticky decision keeps failing on ConditionFailedError (cadence-wo…
Browse files Browse the repository at this point in the history
…rkflow#3423)

When rangeID changed and sticky task ConditionFailedError happened, it is possible for the sticky tasklist to go into a bad state that none task can be added to it because of endless ConditionFailedError. This will cause latency increase for workflow, because all sticky decision task timeout (and workflow can only progress by using normal tasklist)
Once it happened, the only possible way to recover is for user to restart workers (with those affected tasklists)

We want to stablize the tasklist rangeID so that no need for worker to restart to recover using the sticky tasklist.
This PR use a relatively conservative approach to stop infinite retrying sticky failures on ConditionFailedError. By reducing concurrent access to tasklist, it will be more likely to break the endless ConditionFailedError.

Another approach would be clear stickness on worker side, but there would be bigger changes and risks.
  • Loading branch information
vancexu authored Aug 4, 2020
1 parent dee9745 commit db444cf
Show file tree
Hide file tree
Showing 5 changed files with 29 additions and 1 deletion.
3 changes: 3 additions & 0 deletions common/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,6 @@ const (
// TaskTypeReplication is the task type for replication task
TaskTypeReplication
)

// StickyTaskConditionFailedErrorMsg error msg for sticky task ConditionFailedError
const StickyTaskConditionFailedErrorMsg = "StickyTaskConditionFailedError"
8 changes: 8 additions & 0 deletions common/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -783,3 +783,11 @@ func ConvertDynamicConfigMapPropertyToIntMap(
}
return intMap, nil
}

// IsStickyTaskConditionError is error from matching engine
func IsStickyTaskConditionError(err error) bool {
if e, ok := err.(*workflow.InternalServiceError); ok {
return e.GetMessage() == StickyTaskConditionFailedErrorMsg
}
return false
}
8 changes: 8 additions & 0 deletions service/history/task/task.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import (
"time"

workflow "github.com/uber/cadence/.gen/go/shared"
"github.com/uber/cadence/common"
"github.com/uber/cadence/common/cache"
"github.com/uber/cadence/common/clock"
"github.com/uber/cadence/common/log"
Expand Down Expand Up @@ -331,6 +332,13 @@ func (t *taskBase) HandleErr(
return nil
}

if t.GetAttempt() > t.maxRetryCount() && common.IsStickyTaskConditionError(err) {
// sticky task could end up into endless loop in rare cases and
// cause worker to keep getting decision timeout unless restart.
// return nil here to break the endless loop
return nil
}

t.logger.Error("Fail to process task", tag.Error(err), tag.LifeCycleProcessingFailed)
return err
}
Expand Down
7 changes: 7 additions & 0 deletions service/history/taskProcessor.go
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,13 @@ func (t *taskProcessor) handleTaskError(
return nil
}

if taskInfo.attempt > t.config.TimerTaskMaxRetryCount() && common.IsStickyTaskConditionError(err) {
// sticky task could end up into endless loop in rare cases and
// cause worker to keep getting decision timeout unless restart.
// return nil here to break the endless loop
return nil
}

taskInfo.logger.Error("Fail to process task", tag.Error(err), tag.LifeCycleProcessingFailed)
return err
}
Expand Down
4 changes: 3 additions & 1 deletion service/matching/taskListManager.go
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,6 @@ func (c *taskListManagerImpl) Stop() {
c.taskWriter.Stop()
c.taskReader.Stop()
c.engine.removeTaskListManager(c.taskListID)
c.engine.removeTaskListManager(c.taskListID)
c.logger.Info("", tag.LifeCycleStopped)
}

Expand Down Expand Up @@ -479,6 +478,9 @@ func (c *taskListManagerImpl) executeWithRetry(
c.metricScope().IncCounter(metrics.ConditionFailedErrorPerTaskListCounter)
c.logger.Debug(fmt.Sprintf("Stopping task list due to persistence condition failure. Err: %v", err))
c.Stop()
if c.taskListKind == s.TaskListKindSticky {
err = &s.InternalServiceError{Message: common.StickyTaskConditionFailedErrorMsg}
}
}
return
}
Expand Down

0 comments on commit db444cf

Please sign in to comment.